Auto-regressive models have achieved impressive results in 2D image generation by modeling joint distributions in grid space. In this paper, we extend auto-regressive models to 3D domains, and seek a stronger ability of 3D shape generation by improving auto-regressive models at capacity and scalability simultaneously. Firstly, we leverage an ensemble of publicly available 3D datasets to facilitate the training of large-scale models. It consists of a comprehensive collection of approximately 900,000 objects, with multiple properties of meshes, points, voxels, rendered images, and text captions. This diverse labeled dataset, termed Objaverse-Mix, empowers our model to learn from a wide range of object variations. However, directly applying 3D auto-regression encounters critical challenges of high computational demands on volumetric grids and ambiguous auto-regressive order along grid dimensions, resulting in inferior quality of 3D shapes. To this end, we then present a novel framework Argus3D in terms of capacity. Concretely, our approach introduces discrete representation learning based on a latent vector instead of volumetric grids, which not only reduces computational costs but also preserves essential geometric details by learning the joint distributions in a more tractable order. The capacity of conditional generation can thus be realized by simply concatenating various conditioning inputs to the latent vector, such as point clouds, categories, images, and texts. In addition, thanks to the simplicity of our model architecture, we naturally scale up our approach to a larger model with an impressive 3.6 billion parameters, further enhancing the quality of versatile 3D generation. Extensive experiments on four generation tasks demonstrate that Argus3D can synthesize diverse and faithful shapes across multiple categories, achieving remarkable performance.
翻译:自回归模型通过在网格空间中建模联合分布,在二维图像生成领域取得了显著成果。本文将此模型扩展至三维域,通过同步提升自回归模型在容量与可扩展性上的能力,探索更强的三维形状生成性能。首先,我们整合多个公开三维数据集来促进大规模模型训练。该整合数据集包含约90万个物体,涵盖网格、点云、体素、渲染图像及文本描述等多种属性。这一多样化标注数据集(命名为Objaverse-Mix)使模型能够从广泛的对象变体中学习。然而,直接应用三维自回归方法面临体积网格计算需求高、网格维度上的自回归顺序不明确等关键挑战,导致生成的三维形状质量欠佳。为此,我们进一步提出一种基于容量的新型框架Argus3D。具体而言,该方法采用基于潜在向量的离散表示学习替代体积网格,这不仅降低了计算成本,还能通过更易处理的顺序学习联合分布,从而保留关键几何细节。通过简单地将点云、类别、图像和文本等多种条件输入拼接至潜在向量,即可实现条件生成能力。此外,得益于模型架构的简洁性,我们自然地将该方法扩展至包含36亿参数的大规模模型,进一步提升多功能三维生成的质量。在四项生成任务上的广泛实验表明,Argus3D能够跨多个类别合成多样且逼真的形状,取得了卓越性能。