In Embodied Question Answering (EQA), agents must explore and develop a semantic understanding of an unseen environment in order to answer a situated question with confidence. This remains a challenging problem in robotics, due to the difficulties in obtaining useful semantic representations, updating these representations online, and leveraging prior world knowledge for efficient exploration and planning. Aiming to address these limitations, we propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic scene graphs (3DSGs) and task relevant images as multi-modal memory for grounding Vision-Language Models (VLMs) to perform EQA tasks in unseen environments. We employ a hierarchical planning approach that exploits the hierarchical nature of 3DSGs for structured planning and semantic-guided exploration. Through experiments in simulation on the HM-EQA dataset and in the real world in home and office environments, we demonstrate that our method outperforms key baselines by completing EQA tasks with higher success rates and fewer planning steps.
翻译:在具身问答任务中,智能体必须探索并建立对未知环境的语义理解,从而有信心地回答情境化问题。由于难以获取有效的语义表示、在线更新这些表示,并利用先验世界知识进行高效探索与规划,这仍然是机器人学中的一个挑战性问题。为应对这些局限,我们提出GraphEQA——一种新颖方法,该方法利用实时三维度量-语义场景图及任务相关图像作为多模态记忆,将视觉-语言模型锚定于未知环境中以执行具身问答任务。我们采用分层规划方法,利用三维场景图的层次特性进行结构化规划与语义引导的探索。通过在HM-EQA数据集上的仿真实验,以及在家庭和办公室环境中的真实世界实验,我们证明该方法以更高的成功率和更少的规划步骤完成具身问答任务,性能优于关键基线模型。