Prior studies on 3D scene understanding have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal models (3D LMMs) to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes, enabling the handling of sequences that interleave 3D and textual data. It offers a natural approach for translating 3D vision tasks into language formats using task-specific instruction templates. To facilitate the use of referent tokens in subsequent language modeling, we have curated large-scale grounded language datasets that offer finer scene-text correspondence at the phrase level by bootstrapping existing object labels. Subsequently, we introduced Contrastive LAnguage-Scene Pre-training (CLASP) to effectively leverage this data, thereby integrating 3D vision with language models. Our comprehensive evaluation covers open-ended tasks like dense captioning and 3D QA, alongside close-ended tasks such as object detection and language grounding. Experiments across multiple 3D benchmarks reveal the leading performance and the broad applicability of Grounded 3D-LLM. Code and datasets will be released on the project page: https://groundedscenellm.github.io/grounded_3d-llm.github.io.
翻译:先前的三维场景理解研究主要针对特定任务开发专用模型或需要任务特定的微调。本研究提出基于参照词元的三维大语言模型(Grounded 3D-LLM),探索三维大规模多模态模型(3D LMMs)在统一生成框架内整合多种三维视觉任务的潜力。该模型采用场景参照词元作为特殊名词短语来指代三维场景,能够处理三维与文本数据交织的序列,并利用任务特定指令模板将三维视觉任务自然转化为语言格式。为支持参照词元在后续语言建模中的应用,我们通过引导现有目标标签,构建了大规模带语义标注的语言数据集,提供短语级别的精细场景-文本对应关系。随后引入对比语言-场景预训练(CLASP)方法有效利用该数据,实现三维视觉与语言模型的深度融合。我们的综合评估涵盖密集描述、三维问答等开放式任务,以及目标检测、语言定位等闭合式任务。在多个三维基准上的实验表明,Grounded 3D-LLM具有领先性能和广泛适用性。代码与数据集将发布在项目页面:https://groundedscenellm.github.io/grounded_3d-llm.github.io。