We tackle the task of text-to-3D creation with pre-trained latent-based NeRFs (NeRFs that generate 3D objects given input latent code). Recent works such as DreamFusion and Magic3D have shown great success in generating 3D content using NeRFs and text prompts, but the current approach of optimizing a NeRF for every text prompt is 1) extremely time-consuming and 2) often leads to low-resolution outputs. To address these challenges, we propose a novel method named 3D-CLFusion which leverages the pre-trained latent-based NeRFs and performs fast 3D content creation in less than a minute. In particular, we introduce a latent diffusion prior network for learning the w latent from the input CLIP text/image embeddings. This pipeline allows us to produce the w latent without further optimization during inference and the pre-trained NeRF is able to perform multi-view high-resolution 3D synthesis based on the latent. We note that the novelty of our model lies in that we introduce contrastive learning during training the diffusion prior which enables the generation of the valid view-invariant latent code. We demonstrate through experiments the effectiveness of our proposed view-invariant diffusion process for fast text-to-3D creation, e.g., 100 times faster than DreamFusion. We note that our model is able to serve as the role of a plug-and-play tool for text-to-3D with pre-trained NeRFs.
翻译:我们针对利用预训练潜在神经辐射场(NeRF,根据输入潜在码生成三维对象的NeRF)实现文本到三维内容生成的任务展开研究。近期诸如DreamFusion和Magic3D等方法在借助NeRF与文本提示生成三维内容方面取得了显著成功,但当前针对每个文本提示优化NeRF的方案存在两大问题:1)耗时极长;2)输出分辨率往往较低。为应对这些挑战,我们提出名为3D-CLFusion的新方法,该方法利用预训练潜在NeRF,在不到一分钟内实现快速三维内容生成。具体而言,我们引入潜在扩散先验网络,从输入的CLIP文本/图像嵌入中学习w潜在码。该流程使得我们能够在推理时无需进一步优化即可生成w潜在码,而预训练NeRF能基于该潜在码执行多视角高分辨率三维合成。值得注意的是,我们方法的新颖之处在于:在训练扩散先验过程中引入对比学习,从而生成有效且视角不变的潜在码。通过实验证明,我们提出的视角不变扩散过程在快速文本到三维内容生成方面的有效性——例如,速度比DreamFusion快100倍。此外,我们的模型可作为预训练NeRF的即插即用工具,服务于文本到三维生成任务。