Embodied agents equipped with GPT as their brain have exhibited extraordinary decision-making and generalization abilities across various tasks. However, existing zero-shot agents for vision-and-language navigation (VLN) only prompt the GPT-4 to select potential locations within localized environments, without constructing an effective "global-view" for the agent to understand the overall environment. In this work, we present a novel map-guided GPT-based agent, dubbed MapGPT, which introduces an online linguistic-formed map to encourage the global exploration. Specifically, we build an online map and incorporate it into the prompts that include node information and topological relationships, to help GPT understand the spatial environment. Benefiting from this design, we further propose an adaptive planning mechanism to assist the agent in performing multi-step path planning based on a map, systematically exploring multiple candidate nodes or sub-goals step by step. Extensive experiments demonstrate that our MapGPT is applicable to both GPT-4 and GPT-4V, achieving state-of-the-art zero-shot performance on the R2R and REVERIE simultaneously (~10% and ~12% improvements in SR), and showcasing the newly emerged global thinking and path planning abilities of the GPT.
翻译:以GPT为大脑的具身智能体已在各类任务中展现出卓越的决策与泛化能力。然而,现有的视觉与语言导航(VLN)零样本智能体仅能提示GPT-4在局部环境中选择潜在位置,未能构建有效的"全局视图"使智能体理解整体环境。本文提出一种新颖的基于地图引导的GPT智能体MapGPT,通过引入在线语言化地图来促进全局探索。具体而言,我们构建在线地图并将其融入包含节点信息与拓扑关系的提示中,帮助GPT理解空间环境。得益于该设计,我们进一步提出自适应规划机制,辅助智能体基于地图执行多步路径规划,逐步系统地探索多个候选节点或子目标。大量实验表明,MapGPT可同时适用于GPT-4与GPT-4V,在R2R和REVERIE基准上均取得最优零样本性能(SR分别提升约10%和12%),并展现了GPT新涌现的全局思维与路径规划能力。