Recent works learn 3D representation explicitly under text-3D guidance. However, limited text-3D data restricts the vocabulary scale and text control of generations. Generators may easily fall into a stereotype concept for certain text prompts, thus losing open-vocabulary generation ability. To tackle this issue, we introduce a conditional 3D generative model, namely TextField3D. Specifically, rather than using the text prompts as input directly, we suggest to inject dynamic noise into the latent space of given text prompts, i.e., Noisy Text Fields (NTFs). In this way, limited 3D data can be mapped to the appropriate range of textual latent space that is expanded by NTFs. To this end, an NTFGen module is proposed to model general text latent code in noisy fields. Meanwhile, an NTFBind module is proposed to align view-invariant image latent code to noisy fields, further supporting image-conditional 3D generation. To guide the conditional generation in both geometry and texture, multi-modal discrimination is constructed with a text-3D discriminator and a text-2.5D discriminator. Compared to previous methods, TextField3D includes three merits: 1) large vocabulary, 2) text consistency, and 3) low latency. Extensive experiments demonstrate that our method achieves a potential open-vocabulary 3D generation capability.
翻译:近期研究在文本-三维直接引导下学习三维表征。然而,有限的文本-三维数据限制了生成过程中的词汇规模与文本控制能力。生成器容易对特定文本提示陷入刻板概念,从而丧失开放词汇生成能力。为解决该问题,我们提出条件式三维生成模型TextField3D。具体而言,我们不直接将文本提示作为输入,而是在给定文本提示的潜在空间中注入动态噪声,即噪声文本场(NTFs)。通过该方式,有限的二维数据能够映射至由NTFs扩展后的文本潜在空间适当范围。为此,我们提出NTFGen模块用于建模噪声场中的通用文本潜在编码,同时提出NTFBind模块将视角不变图像潜在编码与噪声场对齐,进一步支持图像条件三维生成。为实现几何与纹理的双重条件生成,我们构建了包含文本-三维判别器与文本-2.5D判别器的多模态鉴别机制。与现有方法相比,TextField3D具有三大优势:1)大词汇量,2)文本一致性,3)低延迟。大量实验表明,本方法具备潜在的开放词汇三维生成能力。