The remarkable potential of multi-modal large language models (MLLMs) in comprehending both vision and language information has been widely acknowledged. However, the scarcity of 3D scenes-language pairs in comparison to their 2D counterparts, coupled with the inadequacy of existing approaches in understanding of 3D scenes by LLMs, poses a significant challenge. In response, we collect and construct an extensive dataset comprising 75K instruction-response pairs tailored for 3D scenes. This dataset addresses tasks related to 3D VQA, 3D grounding, and 3D conversation. To further enhance the integration of 3D spatial information into LLMs, we introduce a novel and efficient prompt tuning paradigm, 3DMIT. This paradigm eliminates the alignment stage between 3D scenes and language and extends the instruction prompt with the 3D modality information including the entire scene and segmented objects. We evaluate the effectiveness of our method across diverse tasks in the 3D scene domain and find that our approach serves as a strategic means to enrich LLMs' comprehension of the 3D world. Our code is available at https://github.com/staymylove/3DMIT.
翻译:多模态大语言模型(MLLMs)在理解视觉与语言信息方面的卓越潜力已获得广泛认可。然而,相较于二维场景,三维场景-语言对的稀缺性,加之现有方法在使大语言模型理解三维场景方面的不足,构成了重大挑战。为此,我们收集并构建了一个包含75,000条三维场景指令-响应对的大规模数据集,涵盖了三维视觉问答(VQA)、三维定位和三维对话等任务。为进一步增强大语言模型对三维空间信息的融合能力,我们提出了一种新颖高效的提示词精调范式——3DMIT。该范式省去了三维场景与语言之间的对齐阶段,并将包含完整场景及分割后物体的三维模态信息扩展至指令提示中。我们在三维场景领域的不同任务上评估了方法的有效性,发现本方法可作为丰富大语言模型对三维世界理解的策略性途径。我们的代码已开源至https://github.com/staymylove/3DMIT。