Score distillation sampling (SDS) has shown great promise in text-to-3D generation by distilling pretrained large-scale text-to-image diffusion models, but suffers from over-saturation, over-smoothing, and low-diversity problems. In this work, we propose to model the 3D parameter as a random variable instead of a constant as in SDS and present variational score distillation (VSD), a principled particle-based variational framework to explain and address the aforementioned issues in text-to-3D generation. We show that SDS is a special case of VSD and leads to poor samples with both small and large CFG weights. In comparison, VSD works well with various CFG weights as ancestral sampling from diffusion models and simultaneously improves the diversity and sample quality with a common CFG weight (i.e., $7.5$). We further present various improvements in the design space for text-to-3D such as distillation time schedule and density initialization, which are orthogonal to the distillation algorithm yet not well explored. Our overall approach, dubbed ProlificDreamer, can generate high rendering resolution (i.e., $512\times512$) and high-fidelity NeRF with rich structure and complex effects (e.g., smoke and drops). Further, initialized from NeRF, meshes fine-tuned by VSD are meticulously detailed and photo-realistic. Project page: https://ml.cs.tsinghua.edu.cn/prolificdreamer/
翻译:分数蒸馏采样(SDS)通过蒸馏预训练的大规模文本到图像扩散模型,在文本到3D生成中展现出巨大潜力,但存在过饱和、过度平滑及低多样性问题。本文提出将3D参数建模为随机变量(而非SDS中的常量),并引入变分分数蒸馏(VSD)——一种基于粒子的原则性变分框架,以解释并解决文本到3D生成中的上述问题。我们证明SDS是VSD的特例,且在小/大CFG权重下均会导致低质量样本。相比之下,VSD能像扩散模型中的祖先采样一样适用于多种CFG权重,并在常用CFG权重(如$7.5$)下同时提升多样性与样本质量。我们进一步提出文本到3D设计空间的多项改进(如蒸馏时间调度与密度初始化),这些改进与蒸馏算法正交但尚未充分探索。我们的整体方法ProlificDreamer可生成高渲染分辨率(即$512\times512$)、高保真度且富含复杂结构(如烟雾与液滴)的NeRF。此外,以NeRF初始化的网格模型经VSD微调后具备精细细节与照片级真实感。项目页面:https://ml.cs.tsinghua.edu.cn/prolificdreamer/