Due to the lack of large-scale text-3D correspondence data, recent text-to-3D generation works mainly rely on utilizing 2D diffusion models for synthesizing 3D data. Since diffusion-based methods typically require significant optimization time for both training and inference, the use of GAN-based models would still be desirable for fast 3D generation. In this work, we propose Triplane Attention for text-guided 3D generation (TPA3D), an end-to-end trainable GAN-based deep learning model for fast text-to-3D generation. With only 3D shape data and their rendered 2D images observed during training, our TPA3D is designed to retrieve detailed visual descriptions for synthesizing the corresponding 3D mesh data. This is achieved by the proposed attention mechanisms on the extracted sentence and word-level text features. In our experiments, we show that TPA3D generates high-quality 3D textured shapes aligned with fine-grained descriptions, while impressive computation efficiency can be observed.
翻译:由于缺乏大规模文本-三维对应数据,近期文本到三维生成工作主要依赖利用二维扩散模型来合成三维数据。由于基于扩散的方法通常需要大量的训练和推理优化时间,基于生成对抗网络的模型在快速三维生成中仍具有可取性。本文提出面向文本引导三维生成的三平面注意力机制——TPA3D,一种端到端可训练的基于生成对抗网络的深度学习模型,用于快速文本到三维生成。仅需在训练中观察三维形状数据及其渲染的二维图像,TPA3D即可通过提取的句子级和单词级文本特征上的注意力机制,检索详细视觉描述以合成对应的三维网格数据。实验表明,TPA3D能够生成与细粒度描述对齐的高质量三维纹理形状,同时展现出显著的计算效率。