Text-to-image diffusion models pre-trained on billions of image-text pairs have recently enabled 3D content creation by optimizing a randomly initialized differentiable 3D representation with score distillation. However, the optimization process suffers slow convergence and the resultant 3D models often exhibit two limitations: (a) quality concerns such as missing attributes and distorted shape and texture; (b) extremely low diversity comparing to text-guided image synthesis. In this paper, we show that the conflict between the 3D optimization process and uniform timestep sampling in score distillation is the main reason for these limitations. To resolve this conflict, we propose to prioritize timestep sampling with monotonically non-increasing functions, which aligns the 3D optimization process with the sampling process of diffusion model. Extensive experiments show that our simple redesign significantly improves 3D content creation with faster convergence, better quality and diversity.
翻译:在数十亿图文对数据上预训练的文本到图像扩散模型,通过分数蒸馏优化随机初始化的可微分三维表征,近期实现了三维内容生成。然而,该优化过程存在收敛缓慢的问题,且生成的三维模型常表现出两种局限:(a) 属性缺失、形状与纹理扭曲等质量问题;(b) 与文本引导图像合成相比,生成多样性极低。本文证明,三维优化过程与分数蒸馏中均匀时间步长采样之间的冲突是导致上述局限的主因。为解决该冲突,我们提出采用单调非递增函数优先采样时间步长,使三维优化过程与扩散模型的采样过程对齐。大量实验表明,这一简洁的重设计显著提升了三维内容生成效果,具有更快的收敛速度、更优的质量与多样性。