The advent of large language models, enabling flexibility through instruction-driven approaches, has revolutionized many traditional generative tasks, but large models for 3D data, particularly in comprehensively handling 3D shapes with other modalities, are still under-explored. By achieving instruction-based shape generations, versatile multimodal generative shape models can significantly benefit various fields like 3D virtual construction and network-aided design. In this work, we present ShapeGPT, a shape-included multi-modal framework to leverage strong pre-trained language models to address multiple shape-relevant tasks. Specifically, ShapeGPT employs a word-sentence-paragraph framework to discretize continuous shapes into shape words, further assembles these words for shape sentences, as well as integrates shape with instructional text for multi-modal paragraphs. To learn this shape-language model, we use a three-stage training scheme, including shape representation, multimodal alignment, and instruction-based generation, to align shape-language codebooks and learn the intricate correlations among these modalities. Extensive experiments demonstrate that ShapeGPT achieves comparable performance across shape-relevant tasks, including text-to-shape, shape-to-text, shape completion, and shape editing.
翻译:大语言模型的出现,通过指令驱动方法实现了灵活性,彻底改变了许多传统生成任务,但面向3D数据的大模型——特别是在全面处理3D形状与其他模态方面——仍探索不足。通过实现基于指令的形状生成,多功能的多模态生成形状模型可显著惠及3D虚拟构建、网络辅助设计等众多领域。本文提出ShapeGPT,一种包含形状的多模态框架,利用强大的预训练语言模型处理多种形状相关任务。具体而言,ShapeGPT采用词-句-段框架将连续形状离散化为形状词,进一步组装这些词构成形状句,并将形状与指令文本整合为多模态段落。为学习这一形状-语言模型,我们采用三阶段训练方案(包括形状表示、多模态对齐和基于指令的生成),以对齐形状-语言码本并学习这些模态间的复杂关联。大量实验表明,ShapeGPT在形状相关任务(包括文本到形状、形状到文本、形状补全和形状编辑)中均取得了可比的性能。