Large 2D vision-language models (2D-LLMs) have gained significant attention by bridging Large Language Models (LLMs) with images using a simple projector. Inspired by their success, large 3D point cloud-language models (3D-LLMs) also integrate point clouds into LLMs. However, directly aligning point clouds with LLM requires expensive training costs, typically in hundreds of GPU-hours on A100, which hinders the development of 3D-LLMs. In this paper, we introduce MiniGPT-3D, an efficient and powerful 3D-LLM that achieves multiple SOTA results while training for only 27 hours on one RTX 3090. Specifically, we propose to align 3D point clouds with LLMs using 2D priors from 2D-LLMs, which can leverage the similarity between 2D and 3D visual information. We introduce a novel four-stage training strategy for modality alignment in a cascaded way, and a mixture of query experts module to adaptively aggregate features with high efficiency. Moreover, we utilize parameter-efficient fine-tuning methods LoRA and Norm fine-tuning, resulting in only 47.8M learnable parameters, which is up to 260x fewer than existing methods. Extensive experiments show that MiniGPT-3D achieves SOTA on 3D object classification and captioning tasks, with significantly cheaper training costs. Notably, MiniGPT-3D gains an 8.12 increase on GPT-4 evaluation score for the challenging object captioning task compared to ShapeLLM-13B, while the latter costs 160 total GPU-hours on 8 A800. We are the first to explore the efficient 3D-LLM, offering new insights to the community. Code and weights are available at https://github.com/TangYuan96/MiniGPT-3D.
翻译:大型二维视觉语言模型(2D-LLMs)通过简单投影模块将大语言模型(LLMs)与图像相连,已引起广泛关注。受其成功启发,大型三维点云语言模型(3D-LLMs)也开始将点云集成至LLMs。然而,直接对齐点云与LLM需要高昂的训练成本(通常在A100上耗费数百GPU小时),这阻碍了3D-LLMs的发展。本文提出高效强大的3D-LLM——MiniGPT-3D,仅需在单张RTX 3090上训练27小时即可达到多项最优结果。具体而言,我们提出利用2D-LLMs的二维先验对齐三维点云与LLMs,充分利用二维与三维视觉信息的相似性。我们引入新颖的四阶段级联式训练策略实现模态对齐,并设计混合查询专家模块以高效自适应聚合特征。此外,采用参数高效微调方法LoRA和Norm微调,仅需47.8M可学习参数,较现有方法减少多达260倍。大量实验表明,MiniGPT-3D在三维物体分类和描述任务上以更低训练成本达到最优水平。值得注意的是,在挑战性物体描述任务中,MiniGPT-3D的GPT-4评估分数较ShapeLLM-13B提升8.12分,而后者在8张A800上共需160GPU小时训练。我们首次探索高效3D-LLM,为该领域带来全新见解。代码与权重已开源至https://github.com/TangYuan96/MiniGPT-3D。