We consider the problem of Embodied Question Answering (EQA), which refers to settings where an embodied agent such as a robot needs to actively explore an environment to gather information until it is confident about the answer to a question. In this work, we leverage the strong semantic reasoning capabilities of large vision-language models (VLMs) to efficiently explore and answer such questions. However, there are two main challenges when using VLMs in EQA: they do not have an internal memory for mapping the scene to be able to plan how to explore over time, and their confidence can be miscalibrated and can cause the robot to prematurely stop exploration or over-explore. We propose a method that first builds a semantic map of the scene based on depth information and via visual prompting of a VLM - leveraging its vast knowledge of relevant regions of the scene for exploration. Next, we use conformal prediction to calibrate the VLM's question answering confidence, allowing the robot to know when to stop exploration - leading to a more calibrated and efficient exploration strategy. To test our framework in simulation, we also contribute a new EQA dataset with diverse, realistic human-robot scenarios and scenes built upon the Habitat-Matterport 3D Research Dataset (HM3D). Both simulated and real robot experiments show our proposed approach improves the performance and efficiency over baselines that do no leverage VLM for exploration or do not calibrate its confidence. Webpage with experiment videos and code: https://explore-eqa.github.io/
翻译:我们考虑具身问答(Embodied Question Answering, EQA)问题,其设定涉及机器人等具身智能体需主动探索环境以收集信息,直至对问题答案有足够信心。本文利用大型视觉-语言模型(VLM)强大的语义推理能力高效探索并解答此类问题。然而,在EQA中使用VLM面临两大挑战:它们缺乏用于构建场景地图以规划长期探索路径的内部记忆,且其置信度可能校准不当,导致机器人过早停止探索或过度探索。为此,我们提出一种方法:首先基于深度信息并通过VLM的视觉提示构建场景语义地图——利用其对场景相关区域的广泛知识指导探索;其次,采用共形预测校准VLM的问答置信度,使机器人能判定停止探索的时机,从而形成更校准且高效的探索策略。为在仿真中测试框架,我们还基于Habitat-Matterport 3D研究数据集(HM3D)贡献了一个包含多样化、真实人机交互场景的EQA新数据集。仿真与真实机器人实验均表明,相较于未利用VLM探索或未校准其置信度的基线方法,所提方法在性能与效率上均有提升。实验视频与代码详见网站:https://explore-eqa.github.io/