Introducing Stable Video 3D: Quality Novel View Synthesis and 3D Generation from Single Images

Key Takeaways:

  • Today we are releasing Stable Video 3D (SV3D), a generative model based on Stable Video Diffusion, advancing the field of 3D technology and delivering greatly improved quality and view-consistency.

  • This release features two variants: SV3D_u and SV3D_p. SV3D_u generates orbital videos based on single image inputs without camera conditioning. SV3D_p extends the capability by accommodating both single images and orbital views, allowing for the creation of 3D video along specified camera paths. 

  • Stable Video 3D can be used now for commercial purposes with a Stability AI Membership. For non-commercial use, you can download the model weights on Hugging Face and view our research paper here.

SV3D takes a single object image as input and output novel multi-views of that object. We can then use those novel-views and SV3D to generate 3D meshes.

When we released Stable Video Diffusion, we highlighted the versatility of our video model across various applications. Building upon this foundation, we are excited to release Stable Video 3D. This new model advances the field of 3D technology, delivering greatly improved quality and multi-view when compared to the previously released Stable Zero123, as well as outperforming other open source alternatives such as Zero123-XL.

This release features two variants: 

  • SV3D_u: This variant generates orbital videos based on single image inputs without camera conditioning. 

  • SV3D_p: Extending the capability of SVD3_u, this variant accommodates both single images and orbital views, allowing for the creation of 3D video along specified camera paths. 

Stable Video 3D can be used now for commercial purposes with a Stability AI Membership. For non-commercial use, you can download the model weights on Hugging Face and view our research paper here.

Advantages of Video Diffusion

By adapting our Stable Video Diffusion image-to-video diffusion model with the addition of camera path conditioning, Stable Video 3D is able to generate multi-view videos of an object. The use of video diffusion models, in contrast to image diffusion models as used in Stable Zero123, provides major benefits in generalization and view-consistency of generated outputs. Additionally, we propose improved 3D optimization leveraging this powerful capability of Stable Video 3D to generate arbitrary orbits around an object. By further implementing these techniques with disentangled illumination optimization as well as a new masked score distillation sampling loss function, Stable Video 3D is able to reliably output quality 3D meshes from single image inputs.

See the technical report here for more details on the Stable Video 3D models and experimental comparisons.

Novel-View Generation

Stable Video 3D introduces significant advancements in 3D generation, particularly in novel view synthesis (NVS). Unlike previous approaches that often grapple with limited perspectives and inconsistencies in outputs, Stable Video 3D is able to deliver coherent views from any given angle with proficient generalization. This capability not only enhances pose-controllability, but also ensures consistent object appearance across multiple views, further improving critical aspects of realistic and accurate 3D generations.

Stable Video 3D is able to generate novel multi-views that are more detailed, faithful to the input image, and multi-view consistent compared to existing works.

3D Generation

Stable Video 3D leverages its multi-view consistency to optimize 3D Neural Radiance Fields (NeRF) and mesh representations to improve the quality of 3D meshes generated directly from novel views. For this, we have designed a masked score distillation sampling loss to further enhance 3D quality in regions not visible in the predicted views. Additionally, in order to reduce the issue of baked-in lighting, Stable Video 3D employs a disentangled illumination model that is jointly optimized along with 3D shape and texture.

Sample 3D mesh generations with our 3D optimization using SV3D model and its outputs.

Example of 3D mesh results obtained using SV3D compared to outputs generated from EscherNet and Stable Zero123.

Stable Video 3D can be used now for commercial purposes with a Stability AI Membership. For non-commercial use, you can download the model weights on Hugging Face and view our research paper here.


To stay updated on our progress, follow us on Twitter, Instagram, LinkedIn, and join our Discord Community.

Previous
Previous

Image Services on Stability AI Developer Platform

Next
Next

Celebrating One Year of MedARC