Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability

from arxiv, Project page: https://argus-3d.github.io/ . Datasets: https://huggingface.co/datasets/BAAI/Objaverse-MIX. arXiv admin note: substantial text overlap with arXiv:2303.14700

Auto-regressive models have achieved impressive results in 2D image generation by modeling joint distributions in grid space. In this paper, we extend auto-regressive models to 3D domains, and seek a stronger ability of 3D shape generation by improving auto-regressive models at capacity and scalability simultaneously. Firstly, we leverage an ensemble of publicly available 3D datasets to facilitate the training of large-scale models. It consists of a comprehensive collection of approximately 900,000 objects, with multiple properties of meshes, points, voxels, rendered images, and text captions. This diverse labeled dataset, termed Objaverse-Mix, empowers our model to learn from a wide range of object variations. However, directly applying 3D auto-regression encounters critical challenges of high computational demands on volumetric grids and ambiguous auto-regressive order along grid dimensions, resulting in inferior quality of 3D shapes. To this end, we then present a novel framework Argus3D in terms of capacity. Concretely, our approach introduces discrete representation learning based on a latent vector instead of volumetric grids, which not only reduces computational costs but also preserves essential geometric details by learning the joint distributions in a more tractable order. The capacity of conditional generation can thus be realized by simply concatenating various conditioning inputs to the latent vector, such as point clouds, categories, images, and texts. In addition, thanks to the simplicity of our model architecture, we naturally scale up our approach to a larger model with an impressive 3.6 billion parameters, further enhancing the quality of versatile 3D generation. Extensive experiments on four generation tasks demonstrate that Argus3D can synthesize diverse and faithful shapes across multiple categories, achieving remarkable performance.

翻译：自回归模型通过在网格空间中建模联合分布，在二维图像生成领域取得了显著成果。本文将此模型扩展至三维域，通过同步提升自回归模型在容量与可扩展性上的能力，探索更强的三维形状生成性能。首先，我们整合多个公开三维数据集来促进大规模模型训练。该整合数据集包含约90万个物体，涵盖网格、点云、体素、渲染图像及文本描述等多种属性。这一多样化标注数据集（命名为Objaverse-Mix）使模型能够从广泛的对象变体中学习。然而，直接应用三维自回归方法面临体积网格计算需求高、网格维度上的自回归顺序不明确等关键挑战，导致生成的三维形状质量欠佳。为此，我们进一步提出一种基于容量的新型框架Argus3D。具体而言，该方法采用基于潜在向量的离散表示学习替代体积网格，这不仅降低了计算成本，还能通过更易处理的顺序学习联合分布，从而保留关键几何细节。通过简单地将点云、类别、图像和文本等多种条件输入拼接至潜在向量，即可实现条件生成能力。此外，得益于模型架构的简洁性，我们自然地将该方法扩展至包含36亿参数的大规模模型，进一步提升多功能三维生成的质量。在四项生成任务上的广泛实验表明，Argus3D能够跨多个类别合成多样且逼真的形状，取得了卓越性能。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日