Uni-3DAR: Unified 3D Generation and Understanding via Autoregression on Compressed Spatial Tokens

Recent advancements in large language models and their multi-modal extensions have demonstrated the effectiveness of unifying generation and understanding through autoregressive next-token prediction. However, despite the critical role of 3D structural generation and understanding ({3D GU}) in AI for science, these tasks have largely evolved independently, with autoregressive methods remaining underexplored. To bridge this gap, we introduce Uni-3DAR, a unified framework that seamlessly integrates {3D GU} tasks via autoregressive prediction. At its core, Uni-3DAR employs a novel hierarchical tokenization that compresses 3D space using an octree, leveraging the inherent sparsity of 3D structures. It then applies an additional tokenization for fine-grained structural details, capturing key attributes such as atom types and precise spatial coordinates in microscopic 3D structures. We further propose two optimizations to enhance efficiency and effectiveness. The first is a two-level subtree compression strategy, which reduces the octree token sequence by up to 8x. The second is a masked next-token prediction mechanism tailored for dynamically varying token positions, significantly boosting model performance. By combining these strategies, Uni-3DAR successfully unifies diverse {3D GU} tasks within a single autoregressive framework. Extensive experiments across multiple microscopic {3D GU} tasks, including molecules, proteins, polymers, and crystals, validate its effectiveness and versatility. Notably, Uni-3DAR surpasses previous state-of-the-art diffusion models by a substantial margin, achieving up to 256\% relative improvement while delivering inference speeds up to 21.8x faster. The code is publicly available at https://github.com/dptech-corp/Uni-3DAR.

翻译：近期，大语言模型及其多模态扩展的进展已证明，通过自回归的下一个令牌预测来统一生成与理解任务是行之有效的。然而，尽管三维结构生成与理解（3D GU）在科学人工智能中扮演着关键角色，这些任务在很大程度上仍各自独立发展，自回归方法尚未得到充分探索。为弥合这一差距，我们提出了Uni-3DAR，一个通过自回归预测无缝集成3D GU任务的统一框架。Uni-3DAR的核心在于采用了一种新颖的分层令牌化方法，该方法利用三维结构固有的稀疏性，通过八叉树对三维空间进行压缩。随后，它应用了额外的令牌化来捕捉细粒度的结构细节，例如在微观三维结构中捕获原子类型和精确空间坐标等关键属性。我们进一步提出了两项优化策略以提升效率与效果。第一项是两级子树压缩策略，可将八叉树令牌序列长度减少高达8倍。第二项是针对动态变化令牌位置设计的掩码下一个令牌预测机制，该机制显著提升了模型性能。通过结合这些策略，Uni-3DAR成功地将多样化的3D GU任务统一在单一的自回归框架内。在包括分子、蛋白质、聚合物和晶体在内的多个微观3D GU任务上进行的大量实验，验证了其有效性和通用性。值得注意的是，Uni-3DAR以显著优势超越了此前最先进的扩散模型，实现了高达256%的相对性能提升，同时推理速度最高可加快21.8倍。代码已公开于 https://github.com/dptech-corp/Uni-3DAR。