Large-Vocabulary 3D Diffusion Model with Transformer

Creating diverse and high-quality 3D assets with an automatic generative model is highly desirable. Despite extensive efforts on 3D generation, most existing works focus on the generation of a single category or a few categories. In this paper, we introduce a diffusion-based feed-forward framework for synthesizing massive categories of real-world 3D objects with a single generative model. Notably, there are three major challenges for this large-vocabulary 3D generation: a) the need for expressive yet efficient 3D representation; b) large diversity in geometry and texture across categories; c) complexity in the appearances of real-world objects. To this end, we propose a novel triplane-based 3D-aware Diffusion model with TransFormer, DiffTF, for handling challenges via three aspects. 1) Considering efficiency and robustness, we adopt a revised triplane representation and improve the fitting speed and accuracy. 2) To handle the drastic variations in geometry and texture, we regard the features of all 3D objects as a combination of generalized 3D knowledge and specialized 3D features. To extract generalized 3D knowledge from diverse categories, we propose a novel 3D-aware transformer with shared cross-plane attention. It learns the cross-plane relations across different planes and aggregates the generalized 3D knowledge with specialized 3D features. 3) In addition, we devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge in the encoded triplanes for handling categories with complex appearances. Extensive experiments on ShapeNet and OmniObject3D (over 200 diverse real-world categories) convincingly demonstrate that a single DiffTF model achieves state-of-the-art large-vocabulary 3D object generation performance with large diversity, rich semantics, and high quality.

翻译：生成多样化且高质量的三维资产是自动生成模型的理想目标。尽管在三维生成方面已有大量研究，但现有工作大多集中于单一类别或少数几类物体的生成。本文提出了一种基于扩散的前馈框架，通过单一生成模型合成海量类别的真实世界三维物体。值得注意的是，大词汇量三维生成面临三大挑战：a) 需要表达能力强且高效的三维表示；b) 不同类别间几何与纹理的巨大差异；c) 真实世界物体外观的复杂性。为此，我们提出了一种新颖的基于三平面且支持三维感知的扩散模型DiffTF（Transformer融合），通过三个方面应对这些挑战：1) 考虑效率与鲁棒性，采用改进的三平面表示，提升拟合速度与精度；2) 为处理几何与纹理的剧烈变化，将所有三维物体的特征视为广义三维知识与专用三维特征的组合。为从多样化类别中提取广义三维知识，我们提出了一种新颖的具有共享交叉平面注意力的三维感知Transformer，该模型学习不同平面间的跨平面关系，并将广义三维知识与专用三维特征聚合；3) 此外，我们设计了三维感知编码器/解码器，以增强编码后三平面中的广义三维知识，从而处理具有复杂外观的类别。在ShapeNet和OmniObject3D（超过200个多样化真实世界类别）上的大量实验令人信服地证明，单一DiffTF模型在大词汇量三维物体生成方面达到了最先进水平，兼具高多样性、丰富语义与高质量。