Integrating the 3D world into large language models (3D-based LLMs) has been a promising research direction for 3D scene understanding. However, current 3D-based LLMs fall short in situated understanding due to two key limitations: 1) existing 3D datasets are constructed from a global perspective of the 3D scenes and lack situated context. 2) the architectures of existing 3D-based LLMs lack explicit alignment between the spatial representations of 3D scenes and natural language, limiting their performance in tasks requiring precise spatial reasoning. We address these issues by introducing a scalable situated 3D dataset, named Spartun3D, that incorporates various situated spatial reasoning tasks. Furthermore, we propose Spartun3D-LLM, built on an existing 3D-based LLM but integrated with a novel situated spatial alignment module, aiming to enhance the alignment between 3D visual representations and their corresponding textual descriptions. Experimental results demonstrate that both our proposed dataset and alignment module significantly enhance the situated spatial understanding of 3D-based LLMs.
翻译:将三维世界整合进大语言模型(基于3D的LLMs)一直是三维场景理解领域一个有前景的研究方向。然而,当前基于3D的LLMs在具身理解方面存在不足,这主要源于两个关键局限:1)现有的三维数据集是从三维场景的全局视角构建的,缺乏具身上下文信息。2)现有基于3D的LLMs的架构缺乏三维场景空间表征与自然语言之间的显式对齐,限制了其在需要精确空间推理任务上的性能。我们通过引入一个可扩展的具身三维数据集(命名为Spartun3D)来解决这些问题,该数据集包含了多种具身空间推理任务。此外,我们提出了Spartun3D-LLM,它基于一个现有的基于3D的LLM构建,但集成了一个新颖的具身空间对齐模块,旨在增强三维视觉表征与其对应文本描述之间的对齐。实验结果表明,我们提出的数据集和对齐模块均能显著提升基于3D的LLMs的具身空间理解能力。