通过工具增强实现具身问答的多步推理 (Multi-Step Reasoning for Embodied Question Answering via Tool Augmentation)

Embodied Question Answering (EQA) requires agents to explore 3D environments to obtain observations and answer questions related to the scene. Existing methods leverage VLMs to directly explore the environment and answer questions without explicit thinking or planning, which limits their reasoning ability and results in excessive or inefficient exploration as well as ineffective responses. In this paper, we introduce ToolEQA, an agent that integrates external tools with multi-step reasoning, where external tools can provide more useful information for completing the task, helping the model derive better exploration directions in the next step of reasoning and thus obtaining additional effective information. This enables ToolEQA to generate more accurate responses with a shorter exploration distance. To enhance the model's ability for tool-usage and multi-step reasoning, we further design a novel EQA data generation pipeline that automatically constructs large-scale EQA tasks with reasoning trajectories and corresponding answers. Based on the pipeline, we collect the EQA-RT dataset that contains about 18K tasks, divided into a training set EQA-RT-Train, and two test sets EQA-RT-Seen (scenes overlapping with the training set) and EQA-RT-Unseen (novel scenes). Experiments on EQA-RT-Seen and EQA-RT-Unseen show that ToolEQA improves the success rate by 9.2~20.2% over state-of-the-art baselines, while outperforming the zero-shot ToolEQA by 10% in success rate. In addition, ToolEQA also achieves state-of-the-art performance on the HM-EQA, OpenEQA, and EXPRESS-Bench datasets, demonstrating its generality. Our homepage see https://tooleqa.github.io.

翻译：具身问答（EQA）要求智能体探索三维环境以获取观察结果并回答与场景相关的问题。现有方法利用视觉语言模型（VLM）直接探索环境并回答问题，缺乏显式的思考或规划，这限制了其推理能力，导致探索过程冗余低效且响应效果不佳。本文提出ToolEQA，一种集成外部工具与多步推理的智能体，其中外部工具可为任务完成提供更有用的信息，帮助模型在下一步推理中推导出更优的探索方向，从而获取额外的有效信息。这使得ToolEQA能够以更短的探索距离生成更准确的响应。为增强模型的工具使用和多步推理能力，我们进一步设计了一种新颖的EQA数据生成流程，可自动构建包含推理轨迹及对应答案的大规模EQA任务。基于该流程，我们收集了包含约1.8万个任务的EQA-RT数据集，划分为训练集EQA-RT-Train，以及两个测试集EQA-RT-Seen（场景与训练集重叠）和EQA-RT-Unseen（新场景）。在EQA-RT-Seen和EQA-RT-Unseen上的实验表明，ToolEQA相较于最先进的基线方法将成功率提升了9.2%~20.2%，同时其成功率比零样本ToolEQA高出10%。此外，ToolEQA在HM-EQA、OpenEQA和EXPRESS-Bench数据集上也取得了最先进的性能，证明了其泛化能力。项目主页详见 https://tooleqa.github.io。