By leveraging the text-to-image diffusion priors, score distillation can synthesize 3D contents without paired text-3D training data. Instead of spending hours of online optimization per text prompt, recent studies have been focused on learning a text-to-3D generative network for amortizing multiple text-3D relations, which can synthesize 3D contents in seconds. However, existing score distillation methods are hard to scale up to a large amount of text prompts due to the difficulties in aligning pretrained diffusion prior with the distribution of rendered images from various text prompts. Current state-of-the-arts such as Variational Score Distillation finetune the pretrained diffusion model to minimize the noise prediction error so as to align the distributions, which are however unstable to train and will impair the model's comprehension capability to numerous text prompts. Based on the observation that the diffusion models tend to have lower noise prediction errors at earlier timesteps, we propose Asynchronous Score Distillation (ASD), which minimizes the noise prediction error by shifting the diffusion timestep to earlier ones. ASD is stable to train and can scale up to 100k prompts. It reduces the noise prediction error without changing the weights of pre-trained diffusion model, thus keeping its strong comprehension capability to prompts. We conduct extensive experiments across different 2D diffusion models, including Stable Diffusion and MVDream, and text-to-3D generators, including Hyper-iNGP, 3DConv-Net and Triplane-Transformer. The results demonstrate ASD's effectiveness in stable 3D generator training, high-quality 3D content synthesis, and its superior prompt-consistency, especially under large prompt corpus.
翻译:通过利用文本到图像扩散先验,分数蒸馏技术能够在无需配对文本-3D训练数据的情况下合成3D内容。为替代每个文本提示需数小时在线优化的传统方法,近期研究聚焦于学习文本到3D的生成网络以平摊多组文本-3D关系,实现秒级3D内容合成。然而,由于预训练扩散先验与多样化文本提示下渲染图像分布的对齐困难,现有分数蒸馏方法难以扩展至大规模文本提示集。当前最先进方法如变分分数蒸馏通过微调预训练扩散模型以最小化噪声预测误差来实现分布对齐,但该方法训练不稳定且会损害模型对海量文本提示的理解能力。基于扩散模型在较早时间步往往具有更低噪声预测误差的观察,我们提出异步分数蒸馏(ASD),通过将扩散时间步前移来实现噪声预测误差的最小化。ASD训练稳定且可扩展至10万量级提示词,在不改变预训练扩散模型权重的前提下降低噪声预测误差,从而保持模型对提示词的强大理解能力。我们在多种2D扩散模型(包括Stable Diffusion和MVDream)与文本到3D生成器(包括Hyper-iNGP、3DConv-Net和Triplane-Transformer)上进行了广泛实验。结果表明ASD在稳定3D生成器训练、高质量3D内容合成方面具有显著优势,尤其在大规模提示词库下展现出卓越的提示一致性。