The rapid progress of large multimodal models has inspired efforts toward unified frameworks that couple understanding and generation. While such paradigms have shown remarkable success in 2D, extending them to 3D remains largely underexplored. Existing attempts to unify 3D tasks under a single autoregressive (AR) paradigm lead to significant performance degradation due to forced signal quantization and prohibitive training cost. Our key insight is that the essential challenge lies not in enforcing a unified autoregressive paradigm, but in enabling effective information interaction between generation and understanding while minimally compromising their inherent capabilities and leveraging pretrained models to reduce training cost. Guided by this perspective, we present the first unified framework for 3D understanding and generation that combines autoregression with diffusion. Specifically, we adopt an autoregressive next-token prediction paradigm for 3D understanding, and a continuous diffusion paradigm for 3D generation. A lightweight transformer bridges the feature space of large language models and the conditional space of 3D diffusion models, enabling effective cross-modal information exchange while preserving the priors learned by standalone models. Extensive experiments demonstrate that our framework achieves state-of-the-art performance across diverse 3D understanding and generation benchmarks, while also excelling in 3D editing tasks. These results highlight the potential of unified AR+diffusion models as a promising direction for building more general-purpose 3D intelligence.
翻译:大型多模态模型的快速发展激发了构建耦合理解与生成的统一框架的努力。尽管此类范式在二维领域已展现出显著成功,但将其扩展到三维领域仍很大程度上未被充分探索。现有尝试在单一自回归范式下统一三维任务,由于强制的信号量化和高昂的训练成本,导致了显著的性能下降。我们的核心见解是,关键挑战不在于强制推行统一的自回归范式,而在于实现生成与理解之间的有效信息交互,同时最小化地损害其固有能力,并利用预训练模型以降低训练成本。基于这一视角,我们提出了首个结合自回归与扩散的三维理解与生成统一框架。具体而言,我们采用自回归的下一令牌预测范式进行三维理解,并采用连续扩散范式进行三维生成。一个轻量级Transformer桥接了大语言模型的特征空间与三维扩散模型的条件空间,实现了有效的跨模态信息交换,同时保留了独立模型学习到的先验知识。大量实验表明,我们的框架在多样化的三维理解与生成基准测试中实现了最先进的性能,同时在三维编辑任务中也表现出色。这些结果凸显了统一的自回归+扩散模型作为构建更通用三维智能的一个有前景方向的潜力。