Automatic text-to-3D synthesis has achieved remarkable advancements through the optimization of 3D models. Existing methods commonly rely on pre-trained text-to-image generative models, such as diffusion models, providing scores for 2D renderings of Neural Radiance Fields (NeRFs) and being utilized for optimizing NeRFs. However, these methods often encounter artifacts and inconsistencies across multiple views due to their limited understanding of 3D geometry. To address these limitations, we propose a reformulation of the optimization loss using the diffusion prior. Furthermore, we introduce a novel training approach that unlocks the potential of the diffusion prior. To improve 3D geometry representation, we apply auxiliary depth supervision for NeRF-rendered images and regularize the density field of NeRFs. Extensive experiments demonstrate the superiority of our method over prior works, resulting in advanced photo-realism and improved multi-view consistency.
翻译:自动文本到三维合成通过优化三维模型取得了显著进展。现有方法通常依赖预训练的文本到图像生成模型(如扩散模型),为神经辐射场(NeRFs)的二维渲染提供得分,并用于优化NeRFs。然而,这些方法由于对三维几何的理解有限,常出现伪影和多视图不一致问题。为解决这些局限,我们提出利用扩散先验对优化损失进行重新公式化。此外,我们引入一种新颖的训练方法,以释放扩散先验的潜力。为改进三维几何表示,我们对NeRF渲染图像施加辅助深度监督,并对NeRF的密度场进行正则化。大量实验表明,与先前工作相比,我们的方法具有优越性,实现了更先进的光真实感渲染并显著提升了多视图一致性。