We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D). Given a scene context (e.g., 3D scan), SQA3D requires the tested agent to first understand its situation (position, orientation, etc.) in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation. Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations. These questions examine a wide spectrum of reasoning capabilities for an intelligent agent, ranging from spatial relation comprehension to commonsense understanding, navigation, and multi-hop reasoning. SQA3D imposes a significant challenge to current multi-modal especially 3D reasoning models. We evaluate various state-of-the-art approaches and find that the best one only achieves an overall score of 47.20%, while amateur human participants can reach 90.06%. We believe SQA3D could facilitate future embodied AI research with stronger situation understanding and reasoning capability.
翻译:我们提出一个新任务以基准测试具身智能体的场景理解能力:3D场景中的情景问答(SQA3D)。给定一个场景上下文(如3D扫描),SQA3D要求被测试的智能体首先通过文本描述理解其在3D场景中的情景(位置、朝向等),然后推理周围环境并在该情景下回答问题。基于ScanNet中的650个场景,我们提供了一个围绕6.8千个独特情景的数据集,以及2.04万条描述和3.34万个针对这些情景的多样化推理问题。这些问题测试智能体广泛的推理能力,涵盖空间关系理解、常识推理、导航以及多跳推理。SQA3D对当前多模态尤其是3D推理模型构成了重大挑战。我们评估了多种先进方法,发现最佳方案仅达到47.20%的总分,而业余人类参与者可达90.06%。我们相信SQA3D将推动未来具身AI研究,实现更强的情景理解与推理能力。