Autoregressive (AR) models have achieved remarkable success in natural language and image generation, but their application to 3D shape modeling remains largely unexplored. Unlike diffusion models, AR models enable more efficient and controllable generation with faster inference times, making them especially suitable for data-intensive domains. Traditional 3D generative models using AR approaches often rely on ``next-token" predictions at the voxel or point level. While effective for certain applications, these methods can be restrictive and computationally expensive when dealing with large-scale 3D data. To tackle these challenges, we introduce 3D-WAG, an AR model for 3D implicit distance fields that can perform unconditional shape generation, class-conditioned and also text-conditioned shape generation. Our key idea is to encode shapes as multi-scale wavelet token maps and use a Transformer to predict the ``next higher-resolution token map" in an autoregressive manner. By redefining 3D AR generation task as ``next-scale" prediction, we reduce the computational cost of generation compared to traditional ``next-token" prediction models, while preserving essential geometric details of 3D shapes in a more structured and hierarchical manner. We evaluate 3D-WAG to showcase its benefit by quantitative and qualitative comparisons with state-of-the-art methods on widely used benchmarks. Our results show 3D-WAG achieves superior performance in key metrics like Coverage and MMD, generating high-fidelity 3D shapes that closely match the real data distribution.
翻译:自回归模型在自然语言和图像生成领域已取得显著成功,但其在三维形状建模中的应用仍鲜有探索。与扩散模型不同,自回归模型能够以更快的推理速度实现更高效且可控的生成,尤其适用于数据密集型领域。采用自回归方法的传统三维生成模型通常依赖于体素或点级别的"下一标记"预测。尽管这些方法在某些应用中有效,但在处理大规模三维数据时可能受限且计算成本高昂。为应对这些挑战,我们提出了3D-WAG——一种用于三维隐式距离场的自回归模型,能够执行无条件形状生成、类别条件生成以及文本条件形状生成。我们的核心思想是将形状编码为多尺度小波标记图,并利用Transformer以自回归方式预测"下一更高分辨率标记图"。通过将三维自回归生成任务重新定义为"下一尺度"预测,相比传统的"下一标记"预测模型,我们降低了生成的计算成本,同时以更具结构化和层次化的方式保留了三维形状的关键几何细节。我们在广泛使用的基准测试中,通过与最先进方法进行定量和定性比较来评估3D-WAG,以展示其优势。结果表明,3D-WAG在覆盖率和最大均值差异等关键指标上均取得优越性能,生成的髙保真三维形状与真实数据分布高度吻合。