Text-to-image diffusion models pre-trained on billions of image-text pairs have recently enabled text-to-3D content creation by optimizing a randomly initialized Neural Radiance Fields (NeRF) with score distillation. However, the resultant 3D models exhibit two limitations: (a) quality concerns such as saturated color and the Janus problem; (b) extremely low diversity comparing to text-guided image synthesis. In this paper, we show that the conflict between NeRF optimization process and uniform timestep sampling in score distillation is the main reason for these limitations. To resolve this conflict, we propose to prioritize timestep sampling with monotonically non-increasing functions, which aligns NeRF optimization with the sampling process of diffusion model. Extensive experiments show that our simple redesign significantly improves text-to-3D content creation with higher quality and diversity.
翻译:预训练于数十亿图文对的文本到图像扩散模型,近年来通过利用分数蒸馏优化随机初始化的神经辐射场(NeRF),实现了文本到三维内容的创建。然而,由此生成的三维模型存在两个局限性:(a)质量缺陷,如色彩饱和度过高及"杰纳斯问题"(Janus problem);(b)与文本引导的图像合成相比,多样性极低。本文证明,NeRF优化过程与分数蒸馏中均匀时间步采样之间的冲突是导致这些局限性的主要原因。为解决这一冲突,我们提出采用单调非递增函数优先进行时间步采样,使NeRF优化与扩散模型的采样过程对齐。大量实验表明,这一简单重构显著提升了文本到三维内容创建的质量与多样性。