We tackle the task of text-to-3D creation with pre-trained latent-based NeRFs (NeRFs that generate 3D objects given input latent code). Recent works such as DreamFusion and Magic3D have shown great success in generating 3D content using NeRFs and text prompts, but the current approach of optimizing a NeRF for every text prompt is 1) extremely time-consuming and 2) often leads to low-resolution outputs. To address these challenges, we propose a novel method named 3D-CLFusion which leverages the pre-trained latent-based NeRFs and performs fast 3D content creation in less than a minute. In particular, we introduce a latent diffusion prior network for learning the w latent from the input CLIP text/image embeddings. This pipeline allows us to produce the w latent without further optimization during inference and the pre-trained NeRF is able to perform multi-view high-resolution 3D synthesis based on the latent. We note that the novelty of our model lies in that we introduce contrastive learning during training the diffusion prior which enables the generation of the valid view-invariant latent code. We demonstrate through experiments the effectiveness of our proposed view-invariant diffusion process for fast text-to-3D creation, e.g., 100 times faster than DreamFusion. We note that our model is able to serve as the role of a plug-and-play tool for text-to-3D with pre-trained NeRFs.
翻译:我们针对利用预训练潜在编码神经辐射场(通过输入潜在编码生成三维物体的NeRF)实现文本到三维生成的任务展开研究。近期DreamFusion和Magic3D等工作已展现出使用NeRF与文本提示生成三维内容的显著成效,但当前为每个文本提示优化NeRF的方法存在两大问题:1)极其耗时;2)常导致低分辨率输出。针对这些挑战,我们提出名为3D-CLFusion的新方法,该方法利用预训练潜在编码NeRF,可在不到一分钟内实现快速三维内容创建。具体而言,我们引入潜在扩散先验网络,从输入的CLIP文本/图像嵌入中学习w潜在编码。该流程使我们无需在推理过程中进行额外优化即可生成w潜在编码,预训练NeRF能据此进行多视角高分辨率三维合成。值得注意的是,本模型的创新点在于训练扩散先验时引入对比学习,从而生成有效的视角不变潜在编码。通过实验证明,我们提出的视角不变扩散过程在快速文本到三维生成中具有显著效果,例如速度比DreamFusion快100倍。本模型可作为即插即用工具,配合预训练NeRF实现文本到三维生成。