Being able to carry out complicated vision language reasoning tasks in 3D space represents a significant milestone in developing household robots and human-centered embodied AI. In this work, we demonstrate that a critical and distinct challenge in 3D vision language reasoning is situational awareness, which incorporates two key components: (1) The autonomous agent grounds its self-location based on a language prompt. (2) The agent answers open-ended questions from the perspective of its calculated position. To address this challenge, we introduce SIG3D, an end-to-end Situation-Grounded model for 3D vision language reasoning. We tokenize the 3D scene into sparse voxel representation and propose a language-grounded situation estimator, followed by a situated question answering module. Experiments on the SQA3D and ScanQA datasets show that SIG3D outperforms state-of-the-art models in situation estimation and question answering by a large margin (e.g., an enhancement of over 30% on situation estimation accuracy). Subsequent analysis corroborates our architectural design choices, explores the distinct functions of visual and textual tokens, and highlights the importance of situational awareness in the domain of 3D question answering.
翻译:在三维空间中执行复杂的视觉语言推理任务,标志着家庭机器人和以人为中心的具身人工智能发展的重要里程碑。在本研究中,我们证明三维视觉语言推理中一个关键且独特的挑战是情境感知,其包含两个核心组成部分:(1)自主智能体基于语言提示对其自身位置进行定位;(2)智能体从其计算得出的视角出发回答开放式问题。为应对这一挑战,我们提出了SIG3D——一种用于三维视觉语言推理的端到端情境感知模型。我们将三维场景标记化为稀疏体素表示,并提出了一种基于语言的情境估计器,其后连接一个情境化问答模块。在SQA3D和ScanQA数据集上的实验表明,SIG3D在情境估计和问题回答方面大幅优于现有最优模型(例如,情境估计准确率提升超过30%)。后续分析验证了我们的架构设计选择,探讨了视觉标记与文本标记的不同功能,并凸显了情境感知在三维问答领域的重要性。