Recent advancements in 3D Large Language Models (LLMs) have demonstrated promising capabilities for 3D scene understanding. However, previous methods exhibit deficiencies in general referencing and grounding capabilities for intricate scene comprehension. In this paper, we introduce the use of object identifiers and object-centric representations to interact with scenes at the object level. Specifically, we decompose the input 3D scene into a set of object proposals, each assigned a unique identifier token, which enables efficient object referencing and grounding during user-assistant interactions. Given the scarcity of scene-language data, we model the scene embeddings as a sequence of explicit object-level embeddings, derived from semantic-rich 2D or 3D representations. By employing object identifiers, we transform diverse 3D scene-language tasks into a unified question-answering format, facilitating joint training without the need for additional task-specific heads. With minimal fine-tuning on all downstream tasks, our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
翻译:三维大语言模型(LLM)的最新进展在三维场景理解方面展现出令人期待的能力。然而,先前的方法在复杂场景理解方面表现出普遍指代与定位能力的不足。本文引入对象标识符和以对象为中心的表示方法,在对象层面与场景进行交互。具体而言,我们将输入的三维场景分解为一组对象提案,每个提案分配一个唯一的标识符标记,从而在用户与助手交互过程中实现高效的对象指代与定位。鉴于场景-语言数据的稀缺性,我们将场景嵌入建模为一系列显式的对象级嵌入序列,这些嵌入源自语义丰富的二维或三维表示。通过采用对象标识符,我们将多样化的三维场景-语言任务转化为统一的问答格式,无需额外的任务特定头部即可促进联合训练。通过对所有下游任务进行极少量微调,我们的模型在包括ScanRefer、Multi3DRefer、Scan2Cap、ScanQA和SQA3D在内的基准测试中显著优于现有方法。