Automatic 3D generation has recently attracted widespread attention. Recent methods have greatly accelerated the generation speed, but usually produce less-detailed objects due to limited model capacity or 3D data. Motivated by recent advancements in video diffusion models, we introduce V3D, which leverages the world simulation capacity of pre-trained video diffusion models to facilitate 3D generation. To fully unleash the potential of video diffusion to perceive the 3D world, we further introduce geometrical consistency prior and extend the video diffusion model to a multi-view consistent 3D generator. Benefiting from this, the state-of-the-art video diffusion model could be fine-tuned to generate 360degree orbit frames surrounding an object given a single image. With our tailored reconstruction pipelines, we can generate high-quality meshes or 3D Gaussians within 3 minutes. Furthermore, our method can be extended to scene-level novel view synthesis, achieving precise control over the camera path with sparse input views. Extensive experiments demonstrate the superior performance of the proposed approach, especially in terms of generation quality and multi-view consistency. Our code is available at https://github.com/heheyas/V3D
翻译:自动3D生成技术近期引起了广泛关注。现有方法虽大幅提升了生成速度,但由于模型容量或3D数据的局限,通常只能产生细节较少的物体。受视频扩散模型最新进展的启发,我们提出V3D,利用预训练视频扩散模型的世界模拟能力来促进3D生成。为充分释放视频扩散感知3D世界的潜力,我们进一步引入几何一致性先验,将视频扩散模型扩展为多视角一致的3D生成器。得益于此,最先进的视频扩散模型可通过微调,基于单张图像生成围绕物体的360度轨道帧。借助我们定制的重建管线,可在3分钟内生成高质量网格或3D高斯表示。此外,我们的方法可扩展至场景级新视角合成,在稀疏输入视角下实现对相机路径的精确控制。大量实验表明,本方法在生成质量与多视角一致性方面均具有优越性能。我们的代码开源于 https://github.com/heheyas/V3D