Due to the lack of large-scale text-3D correspondence data, recent text-to-3D generation works mainly rely on utilizing 2D diffusion models for synthesizing 3D data. Since diffusion-based methods typically require significant optimization time for both training and inference, the use of GAN-based models would still be desirable for fast 3D generation. In this work, we propose Triplane Attention for text-guided 3D generation (TPA3D), an end-to-end trainable GAN-based deep learning model for fast text-to-3D generation. With only 3D shape data and their rendered 2D images observed during training, our TPA3D is designed to retrieve detailed visual descriptions for synthesizing the corresponding 3D mesh data. This is achieved by the proposed attention mechanisms on the extracted sentence and word-level text features. In our experiments, we show that TPA3D generates high-quality 3D textured shapes aligned with fine-grained descriptions, while impressive computation efficiency can be observed.
翻译:由于缺乏大规模文本-3D对应数据,近期的文本到3D生成工作主要依赖于利用2D扩散模型来合成3D数据。由于基于扩散的方法通常需要大量的优化时间用于训练和推理,因此基于GAN的模型对于快速3D生成仍然具有吸引力。在本工作中,我们提出了用于文本引导3D生成的三平面注意力机制(TPA3D),这是一种端到端可训练的、基于GAN的深度学习模型,用于快速文本到3D生成。在训练过程中仅使用3D形状数据及其渲染的2D图像,我们的TPA3D旨在检索详细的视觉描述以合成相应的3D网格数据。这是通过在提取的句子级和单词级文本特征上应用所提出的注意力机制来实现的。在我们的实验中,我们表明TPA3D能够生成与细粒度描述对齐的高质量3D纹理形状,同时展现出令人印象深刻的计算效率。