Large Language Models(LLMs) have revolutionized text generation and multimodal perception, but their capabilities in 3D content generation remain underexplored. Existing methods compromise by producing either low-resolution meshes or coarse structural proxies, failing to capture fine-grained geometry natively. In this paper, we propose CG-MLLM, a novel Multi-modal Large Language Model (MLLM) capable of 3D captioning and high-resolution 3D generation in a single framework. Leveraging the Mixture-of-Transformer architecture, CG-MLLM decouples disparate modeling needs, where the Token-level Autoregressive (TokenAR) Transformer handles token-level content, and the Block-level Autoregressive (BlockAR) Transformer handles block-level content. By integrating a pre-trained vision-language backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard tokens and spatial blocks within a single integrated architecture. Experimental results show that CG-MLLM significantly outperforms existing MLLMs in generating high-fidelity 3D objects, effectively bringing high-resolution 3D content creation into the mainstream LLM paradigm.
翻译:大语言模型(LLMs)已彻底改变了文本生成和多模态感知领域,但其在三维内容生成方面的能力仍未得到充分探索。现有方法通常需要妥协,要么生成低分辨率网格,要么生成粗糙的结构代理,无法原生地捕捉细粒度几何细节。本文提出CG-MLLM,一种新颖的多模态大语言模型(MLLM),能够在单一框架内实现三维内容描述和高分辨率三维生成。通过利用混合Transformer架构,CG-MLLM解耦了不同的建模需求:其中Token级自回归(TokenAR)Transformer处理token级内容,而块级自回归(BlockAR)Transformer处理块级内容。通过将预训练的视觉-语言骨干网络与专门的三维VAE潜在空间相结合,CG-MLLM在单一集成架构内促进了标准token与空间块之间的长上下文交互。实验结果表明,CG-MLLM在生成高保真三维物体方面显著优于现有MLLMs,有效地将高分辨率三维内容创作带入主流LLM范式。