Open-domain 3D object synthesis has been lagging behind image synthesis due to limited data and higher computational complexity. To bridge this gap, recent works have investigated multi-view diffusion but often fall short in either 3D consistency, visual quality, or efficiency. This paper proposes MVEdit, which functions as a 3D counterpart of SDEdit, employing ancestral sampling to jointly denoise multi-view images and output high-quality textured meshes. Built on off-the-shelf 2D diffusion models, MVEdit achieves 3D consistency through a training-free 3D Adapter, which lifts the 2D views of the last timestep into a coherent 3D representation, then conditions the 2D views of the next timestep using rendered views, without uncompromising visual quality. With an inference time of only 2-5 minutes, this framework achieves better trade-off between quality and speed than score distillation. MVEdit is highly versatile and extendable, with a wide range of applications including text/image-to-3D generation, 3D-to-3D editing, and high-quality texture synthesis. In particular, evaluations demonstrate state-of-the-art performance in both image-to-3D and text-guided texture generation tasks. Additionally, we introduce a method for fine-tuning 2D latent diffusion models on small 3D datasets with limited resources, enabling fast low-resolution text-to-3D initialization.
翻译:开放域三维物体合成因数据有限及计算复杂度高,长期落后于图像合成。为弥合这一差距,近期研究探索了多视角扩散方法,但往往在三维一致性、视觉质量或效率方面存在不足。本文提出MVEdit,作为SDEdit的三维对应方法,利用祖先采样联合去噪多视角图像,输出高质量纹理网格。基于现成的二维扩散模型,MVEdit通过免训练的三维适配器实现三维一致性——该适配器将上一时间步的二维视图提升为连贯的三维表示,再利用渲染视图约束下一时间步的二维视图,同时保持无损的视觉质量。在仅需2-5分钟推理时间下,该框架实现了比分数蒸馏更优的质量与速度平衡。MVEdit具有高度通用性和可扩展性,应用涵盖文本/图像到三维生成、三维到三维编辑及高质量纹理合成。特别地,评估表明其在图像到三维生成和文本引导纹理生成任务中均达到最先进性能。此外,我们提出一种在有限资源的小型三维数据集上微调二维潜在扩散模型的方法,实现了快速低分辨率文本引导三维初始化。