As a promising 3D generation technique, multiview diffusion (MVD) has received a lot of attention due to its advantages in terms of generalizability, quality, and efficiency. By finetuning pretrained large image diffusion models with 3D data, the MVD methods first generate multiple views of a 3D object based on an image or text prompt and then reconstruct 3D shapes with multiview 3D reconstruction. However, the sparse views and inconsistent details in the generated images make 3D reconstruction challenging. We present MVD$^2$, an efficient 3D reconstruction method for multiview diffusion (MVD) images. MVD$^2$ aggregates image features into a 3D feature volume by projection and convolution and then decodes volumetric features into a 3D mesh. We train MVD$^2$ with 3D shape collections and MVD images prompted by rendered views of 3D shapes. To address the discrepancy between the generated multiview images and ground-truth views of the 3D shapes, we design a simple-yet-efficient view-dependent training scheme. MVD$^2$ improves the 3D generation quality of MVD and is fast and robust to various MVD methods. After training, it can efficiently decode 3D meshes from multiview images within one second. We train MVD$^2$ with Zero-123++ and ObjectVerse-LVIS 3D dataset and demonstrate its superior performance in generating 3D models from multiview images generated by different MVD methods, using both synthetic and real images as prompts.
翻译:作为一种前景广阔的三维生成技术,多视图扩散(MVD)因其在泛化性、质量和效率方面的优势而备受关注。通过利用三维数据微调预训练的大规模图像扩散模型,MVD方法首先生成基于图像或文本提示的三维物体多视图,继而通过多视图三维重建恢复三维形状。然而,生成图像中稀疏的视角与不一致的细节给三维重建带来了挑战。本文提出MVD$^2$,一种面向多视图扩散(MVD)图像的高效三维重建方法。MVD$^2$通过投影与卷积操作将图像特征聚合为三维特征体,再将体特征解码为三维网格。我们利用三维形状数据集及以三维形状渲染视图为提示的MVD图像对MVD$^2$进行训练。为弥合生成多视图图像与三维形状真实视图之间的差异,我们设计了一种简洁高效的自适应视图训练方案。MVD$^2$不仅提升了MVD的三维生成质量,对不同MVD方法亦具备快速性与鲁棒性。经训练后,该方法可在1秒内从多视图图像高效解码出三维网格。我们采用Zero-123++与ObjectVerse-LVIS三维数据集训练MVD$^2$,并通过合成图像与真实图像作为提示,验证了其在不同MVD方法生成的多视图图像中重建三维模型的优越性能。