Embodied agents equipped with GPT as their brain have exhibited extraordinary thinking and decision-making abilities across various tasks. However, existing zero-shot agents for vision-and-language navigation (VLN) only prompt the GPT to handle excessive environmental information and select potential locations within localized environments, without constructing an effective ''global-view'' (e.g., a commonly-used map) for the agent to understand the overall environment. In this work, we present a novel map-guided GPT-based path-planning agent, dubbed MapGPT, for the zero-shot VLN task. Specifically, we convert a topological map constructed online into prompts to encourage map-guided global exploration, and require the agent to explicitly output and update multi-step path planning to avoid getting stuck in local exploration. Extensive experiments demonstrate that our MapGPT is effective, achieving impressive performance on both the R2R and REVERIE datasets (38.8% and 28.4% success rate, respectively) and showcasing the newly emerged global thinking and path planning capabilities of the GPT model. Unlike previous VLN agents, which require separate parameters fine-tuning or specific prompt design to accommodate various instruction styles across different datasets, our MapGPT is more unified as it can adapt to different instruction styles seamlessly, which is the first of its kind in this field.
翻译:以GPT为大脑的具身智能体在各类任务中展现出卓越的思考与决策能力。然而,现有面向视觉语言导航(VLN)的零样本智能体仅通过提示GPT处理冗余环境信息并在局部环境中选择潜在位置,未能构建有效的"全局视图"(如常用地图)以助其理解整体环境。本文提出一种新颖的基于地图引导的GPT路径规划智能体MapGPT,专为零样本VLN任务设计。具体而言,我们将在线构建的拓扑地图转化为提示信息以促进地图引导的全局探索,并强制智能体显式输出与更新多步路径规划,避免陷入局部探索困境。大量实验表明,MapGPT在R2R和REVERIE数据集上均展现出卓越性能(成功率分别达38.8%和28.4%),充分彰显了GPT模型新涌现的全局思维与路径规划能力。与需要独立参数微调或特定提示设计以适应不同数据集指令风格的现有VLN智能体不同,MapGPT能无缝适配多样化指令风格,是该领域首个具备统一性的智能体。