We propose \textbf{DMV3D}, a novel 3D generation approach that uses a transformer-based 3D large reconstruction model to denoise multi-view diffusion. Our reconstruction model incorporates a triplane NeRF representation and can denoise noisy multi-view images via NeRF reconstruction and rendering, achieving single-stage 3D generation in $\sim$30s on single A100 GPU. We train \textbf{DMV3D} on large-scale multi-view image datasets of highly diverse objects using only image reconstruction losses, without accessing 3D assets. We demonstrate state-of-the-art results for the single-image reconstruction problem where probabilistic modeling of unseen object parts is required for generating diverse reconstructions with sharp textures. We also show high-quality text-to-3D generation results outperforming previous 3D diffusion models. Our project website is at: https://justimyhxu.github.io/projects/dmv3d/ .
翻译:我们提出\textbf{DMV3D},一种新颖的3D生成方法,该方法利用基于Transformer的3D大型重建模型对多视角扩散过程进行去噪。我们的重建模型采用三平面NeRF表示,可通过NeRF重建与渲染对带噪多视角图像进行去噪,在单块A100 GPU上实现约30秒的单阶段3D生成。我们在包含高度多样化物体的大规模多视角图像数据集上训练\textbf{DMV3D},仅使用图像重建损失,无需访问3D资产。我们展示了在单图像重建问题上的最新成果——该场景需要对未见物体部分进行概率建模以生成具有锐利纹理的多样化重建结果。同时,我们展示了高质量的文本到3D生成结果,性能优于先前的3D扩散模型。项目网站地址:https://justimyhxu.github.io/projects/dmv3d/。