The advent of large language models, enabling flexibility through instruction-driven approaches, has revolutionized many traditional generative tasks, but large models for 3D data, particularly in comprehensively handling 3D shapes with other modalities, are still under-explored. By achieving instruction-based shape generations, versatile multimodal generative shape models can significantly benefit various fields like 3D virtual construction and network-aided design. In this work, we present ShapeGPT, a shape-included multi-modal framework to leverage strong pre-trained language models to address multiple shape-relevant tasks. Specifically, ShapeGPT employs a word-sentence-paragraph framework to discretize continuous shapes into shape words, further assembles these words for shape sentences, as well as integrates shape with instructional text for multi-modal paragraphs. To learn this shape-language model, we use a three-stage training scheme, including shape representation, multimodal alignment, and instruction-based generation, to align shape-language codebooks and learn the intricate correlations among these modalities. Extensive experiments demonstrate that ShapeGPT achieves comparable performance across shape-relevant tasks, including text-to-shape, shape-to-text, shape completion, and shape editing.
翻译:大语言模型的出现,通过指令驱动的方法实现了灵活性,彻底改变了众多传统生成任务,但针对三维数据的大模型,尤其是在全面处理三维形状与其他模态方面,仍处于探索不足阶段。通过实现基于指令的形状生成,多功能的多模态生成式形状模型可显著惠及三维虚拟构建、网络辅助设计等各领域。本文提出ShapeGPT——一种包含形状的多模态框架,利用强大的预训练语言模型处理多种形状相关任务。具体而言,ShapeGPT采用词-句-段框架,将连续形状离散化为形状词,进一步将这些词组装为形状句,并将形状与指令文本集成以形成多模态段落。为学习这一形状语言模型,我们采用三阶段训练方案,包括形状表征、多模态对齐和基于指令的生成,以对齐形状语言码本并学习这些模态间的复杂关联。大量实验表明,ShapeGPT在文本到形状、形状到文本、形状补全和形状编辑等形状相关任务中均取得了可比的性能。