This work explores expanding the capabilities of large language models (LLMs) pretrained on text to generate 3D meshes within a unified model. This offers key advantages of (1) leveraging spatial knowledge already embedded in LLMs, derived from textual sources like 3D tutorials, and (2) enabling conversational 3D generation and mesh understanding. A primary challenge is effectively tokenizing 3D mesh data into discrete tokens that LLMs can process seamlessly. To address this, we introduce LLaMA-Mesh, a novel approach that represents the vertex coordinates and face definitions of 3D meshes as plain text, allowing direct integration with LLMs without expanding the vocabulary. We construct a supervised fine-tuning (SFT) dataset enabling pretrained LLMs to (1) generate 3D meshes from text prompts, (2) produce interleaved text and 3D mesh outputs as required, and (3) understand and interpret 3D meshes. Our work is the first to demonstrate that LLMs can be fine-tuned to acquire complex spatial knowledge for 3D mesh generation in a text-based format, effectively unifying the 3D and text modalities. LLaMA-Mesh achieves mesh generation quality on par with models trained from scratch while maintaining strong text generation performance.
翻译:本研究探索将预训练于文本的大型语言模型(LLMs)的能力扩展至在统一模型中生成三维网格。这具有以下关键优势:(1)利用LLMs已从文本源(如三维教程)中习得并内化的空间知识;(2)实现对话式三维生成与网格理解。主要挑战在于如何将三维网格数据有效地分词为LLMs可无缝处理的离散标记。为此,我们提出了LLaMA-Mesh,一种创新方法,将三维网格的顶点坐标与面定义表示为纯文本,从而无需扩展词汇表即可与LLMs直接集成。我们构建了一个监督微调(SFT)数据集,使预训练的LLMs能够:(1)根据文本提示生成三维网格;(2)按要求生成交错排列的文本与三维网格输出;(3)理解并解释三维网格。我们的工作首次证明,LLMs可通过微调以基于文本的格式获取用于三维网格生成的复杂空间知识,从而有效统一三维与文本模态。LLaMA-Mesh在网格生成质量上与从头训练的模型相当,同时保持了强大的文本生成性能。