MiniGPT-3D: Efficiently Aligning 3D Point Clouds with Large Language Models using 2D Priors

Large 2D vision-language models (2D-LLMs) have gained significant attention by bridging Large Language Models (LLMs) with images using a simple projector. Inspired by their success, large 3D point cloud-language models (3D-LLMs) also integrate point clouds into LLMs. However, directly aligning point clouds with LLM requires expensive training costs, typically in hundreds of GPU-hours on A100, which hinders the development of 3D-LLMs. In this paper, we introduce MiniGPT-3D, an efficient and powerful 3D-LLM that achieves multiple SOTA results while training for only 27 hours on one RTX 3090. Specifically, we propose to align 3D point clouds with LLMs using 2D priors from 2D-LLMs, which can leverage the similarity between 2D and 3D visual information. We introduce a novel four-stage training strategy for modality alignment in a cascaded way, and a mixture of query experts module to adaptively aggregate features with high efficiency. Moreover, we utilize parameter-efficient fine-tuning methods LoRA and Norm fine-tuning, resulting in only 47.8M learnable parameters, which is up to 260x fewer than existing methods. Extensive experiments show that MiniGPT-3D achieves SOTA on 3D object classification and captioning tasks, with significantly cheaper training costs. Notably, MiniGPT-3D gains an 8.12 increase on GPT-4 evaluation score for the challenging object captioning task compared to ShapeLLM-13B, while the latter costs 160 total GPU-hours on 8 A800. We are the first to explore the efficient 3D-LLM, offering new insights to the community. Code and weights are available at https://github.com/TangYuan96/MiniGPT-3D.

翻译：大型二维视觉语言模型（2D-LLMs）通过简单投影模块将大语言模型（LLMs）与图像相连，已引起广泛关注。受其成功启发，大型三维点云语言模型（3D-LLMs）也开始将点云集成至LLMs。然而，直接对齐点云与LLM需要高昂的训练成本（通常在A100上耗费数百GPU小时），这阻碍了3D-LLMs的发展。本文提出高效强大的3D-LLM——MiniGPT-3D，仅需在单张RTX 3090上训练27小时即可达到多项最优结果。具体而言，我们提出利用2D-LLMs的二维先验对齐三维点云与LLMs，充分利用二维与三维视觉信息的相似性。我们引入新颖的四阶段级联式训练策略实现模态对齐，并设计混合查询专家模块以高效自适应聚合特征。此外，采用参数高效微调方法LoRA和Norm微调，仅需47.8M可学习参数，较现有方法减少多达260倍。大量实验表明，MiniGPT-3D在三维物体分类和描述任务上以更低训练成本达到最优水平。值得注意的是，在挑战性物体描述任务中，MiniGPT-3D的GPT-4评估分数较ShapeLLM-13B提升8.12分，而后者在8张A800上共需160GPU小时训练。我们首次探索高效3D-LLM，为该领域带来全新见解。代码与权重已开源至https://github.com/TangYuan96/MiniGPT-3D。

相关内容

点云

关注 0

根据激光测量原理得到的点云，包括三维坐标（XYZ）和激光反射强度（Intensity）。根据摄影测量原理得到的点云，包括三维坐标（XYZ）和颜色信息（RGB）。结合激光测量和摄影测量原理得到点云，包括三维坐标（XYZ）、激光反射强度（Intensity）和颜色信息（RGB）。在获取物体表面每个采样点的空间坐标后，得到的是一个点的集合，称之为“点云”(Point Cloud)

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日