Score distillation sampling (SDS) has shown great promise in text-to-3D generation by distilling pretrained large-scale text-to-image diffusion models, but suffers from over-saturation, over-smoothing, and low-diversity problems. In this work, we propose to model the 3D parameter as a random variable instead of a constant as in SDS and present variational score distillation (VSD), a principled particle-based variational framework to explain and address the aforementioned issues in text-to-3D generation. We show that SDS is a special case of VSD and leads to poor samples with both small and large CFG weights. In comparison, VSD works well with various CFG weights as ancestral sampling from diffusion models and simultaneously improves the diversity and sample quality with a common CFG weight (i.e., $7.5$). We further present various improvements in the design space for text-to-3D such as distillation time schedule and density initialization, which are orthogonal to the distillation algorithm yet not well explored. Our overall approach, dubbed ProlificDreamer, can generate high rendering resolution (i.e., $512\times512$) and high-fidelity NeRF with rich structure and complex effects (e.g., smoke and drops). Further, initialized from NeRF, meshes fine-tuned by VSD are meticulously detailed and photo-realistic. Project page and codes: https://ml.cs.tsinghua.edu.cn/prolificdreamer/
翻译:分数蒸馏采样(SDS)通过蒸馏预训练的大规模文本到图像扩散模型,在文本到3D生成领域展现出巨大潜力,但存在过饱和、过度平滑及低多样性等问题。本文提出将3D参数建模为随机变量(而非SDS中的常量),提出变分分数蒸馏(VSD)——一个基于粒子的原则性变分框架,用于解释并解决上述文本到3D生成中的问题。我们证明SDS是VSD的特例,且在小/大CFG权重下均会导致劣质样本。相比之下,VSD能兼容各种CFG权重(如祖先采样扩散模型),并在常用CFG权重(即$7.5$)下同时提升多样性与样本质量。此外,我们提出了文本到3D设计空间的多项改进(如蒸馏时间调度和密度初始化),这些改进与蒸馏算法正交且尚未被充分探索。我们的整体方法ProlificDreamer可生成高渲染分辨率(即$512\times512$)的高保真度NeRF,并包含丰富结构与复杂效果(如烟雾和液滴)。进一步地,以NeRF为初始化的网格经VSD微调后具有精细细节与照片级真实感。项目主页与代码:https://ml.cs.tsinghua.edu.cn/prolificdreamer/