Embodied Question Answering (EQA) is an essential yet challenging task for robotic home assistants. Recent studies have shown that large vision-language models (VLMs) can be effectively utilized for EQA, but existing works either focus on video-based question answering without embodied exploration or rely on closed-form choice sets. In real-world scenarios, a robotic agent must efficiently explore and accurately answer questions in open-vocabulary settings. To address these challenges, we propose a novel framework called EfficientEQA for open-vocabulary EQA, which enables efficient exploration and accurate answering. In EfficientEQA, the robot actively explores unknown environments using Semantic-Value-Weighted Frontier Exploration, a strategy that prioritizes exploration based on semantic importance provided by calibrated confidence from black-box VLMs to quickly gather relevant information. To generate accurate answers, we employ Retrieval-Augmented Generation (RAG), which utilizes BLIP to retrieve useful images from accumulated observations and VLM reasoning to produce responses without relying on predefined answer choices. Additionally, we detect observations that are highly relevant to the question as outliers, allowing the robot to determine when it has sufficient information to stop exploring and provide an answer. Experimental results demonstrate the effectiveness of our approach, showing an improvement in answering accuracy by over 15% and efficiency, measured in running steps, by over 20% compared to state-of-the-art methods.
翻译:具身问答(EQA)是机器人家庭助手面临的一项关键且具有挑战性的任务。近期研究表明,大型视觉-语言模型(VLMs)可有效用于EQA,但现有工作要么专注于无需具身探索的基于视频的问答,要么依赖于封闭式选项集。在现实场景中,机器人代理必须在开放词汇设置下高效探索并准确回答问题。为应对这些挑战,我们提出了一种名为EfficientEQA的新型开放词汇EQA框架,该框架能够实现高效探索与精准回答。在EfficientEQA中,机器人采用语义价值加权前沿探索策略主动探索未知环境,该策略基于黑盒VLMs校准置信度提供的语义重要性来优先探索区域,从而快速收集相关信息。为生成准确答案,我们采用检索增强生成(RAG)技术,利用BLIP从累积观察中检索有用图像,并通过VLM推理生成回答,无需依赖预定义答案选项。此外,我们通过检测与问题高度相关的观察作为异常值,使机器人能够判断何时已收集足够信息以停止探索并给出答案。实验结果表明,我们的方法在回答准确率上相比现有最优方法提升超过15%,在以运行步数衡量的效率上提升超过20%。