Recent CLIP-guided 3D optimization methods, such as DreamFields and PureCLIPNeRF, have achieved impressive results in zero-shot text-to-3D synthesis. However, due to scratch training and random initialization without prior knowledge, these methods often fail to generate accurate and faithful 3D structures that conform to the input text. In this paper, we make the first attempt to introduce explicit 3D shape priors into the CLIP-guided 3D optimization process. Specifically, we first generate a high-quality 3D shape from the input text in the text-to-shape stage as a 3D shape prior. We then use it as the initialization of a neural radiance field and optimize it with the full prompt. To address the challenging text-to-shape generation task, we present a simple yet effective approach that directly bridges the text and image modalities with a powerful text-to-image diffusion model. To narrow the style domain gap between the images synthesized by the text-to-image diffusion model and shape renderings used to train the image-to-shape generator, we further propose to jointly optimize a learnable text prompt and fine-tune the text-to-image diffusion model for rendering-style image generation. Our method, Dream3D, is capable of generating imaginative 3D content with superior visual quality and shape accuracy compared to state-of-the-art methods.
翻译:近期,CLIP引导的3D优化方法(如DreamFields和PureCLIPNeRF)在零样本文本到3D合成任务中取得了显著成果。然而,由于缺乏先验知识而采用从头训练与随机初始化策略,这些方法往往难以生成准确且符合输入文本的3D结构。本文首次尝试将显式3D形状先验引入CLIP引导的3D优化过程。具体而言,我们首先在文本到形状阶段根据输入文本生成高质量3D形状作为3D形状先验,随后将其作为神经辐射场的初始化条件,并通过完整提示词完成优化。为应对文本到形状生成的挑战,我们提出了一种简洁高效的方法——直接利用强大的文本到图像扩散模型桥接文本与图像模态。为缩小文本到图像扩散模型合成图像与训练图像到形状生成器所用形状渲染图之间的风格域差异,我们进一步提出联合优化可学习文本提示并对文本到图像扩散模型进行微调,以生成渲染风格的图像。与现有最优方法相比,我们的Dream3D方法能够生成更具视觉质量与形状准确性的富有想象力的3D内容。