We consider the problem of Embodied Question Answering (EQA), which refers to settings where an embodied agent such as a robot needs to actively explore an environment to gather information until it is confident about the answer to a question. In this work, we leverage the strong semantic reasoning capabilities of large vision-language models (VLMs) to efficiently explore and answer such questions. However, there are two main challenges when using VLMs in EQA: they do not have an internal memory for mapping the scene to be able to plan how to explore over time, and their confidence can be miscalibrated and can cause the robot to prematurely stop exploration or over-explore. We propose a method that first builds a semantic map of the scene based on depth information and via visual prompting of a VLM - leveraging its vast knowledge of relevant regions of the scene for exploration. Next, we use conformal prediction to calibrate the VLM's question answering confidence, allowing the robot to know when to stop exploration - leading to a more calibrated and efficient exploration strategy. To test our framework in simulation, we also contribute a new EQA dataset with diverse, realistic human-robot scenarios and scenes built upon the Habitat-Matterport 3D Research Dataset (HM3D). Both simulated and real robot experiments show our proposed approach improves the performance and efficiency over baselines that do no leverage VLM for exploration or do not calibrate its confidence. Webpage with experiment videos and code: https://explore-eqa.github.io/
翻译:本文研究具身问答问题,即具身智能体(如机器人)需主动探索环境以收集信息,直至对问题答案具有足够置信度。本研究利用大规模视觉语言模型强大的语义推理能力,以高效探索并回答此类问题。然而,在具身问答中应用视觉语言模型面临两大挑战:模型缺乏用于场景映射的内部记忆以规划持续探索过程,且其置信度可能存在误校准,导致机器人过早终止探索或过度探索。我们提出一种方法:首先基于深度信息并通过视觉提示视觉语言模型构建场景语义地图——利用其关于场景相关区域的先验知识指导探索;其次采用保形预测方法校准视觉语言模型的问答置信度,使机器人能够判断何时终止探索,从而形成更可靠且高效的探索策略。为在仿真环境中测试框架,我们基于Habitat-Matterport三维研究数据集构建了包含多样化真实人机交互场景的新型具身问答数据集。仿真与真实机器人实验均表明,相较于未利用视觉语言模型进行探索或未校准置信度的基线方法,本文提出的方法在性能与效率上均有显著提升。实验视频与代码网页:https://explore-eqa.github.io/