NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models

Trained with an unprecedented scale of data, large language models (LLMs) like ChatGPT and GPT-4 exhibit the emergence of significant reasoning abilities from model scaling. Such a trend underscored the potential of training LLMs with unlimited language data, advancing the development of a universal embodied agent. In this work, we introduce the NavGPT, a purely LLM-based instruction-following navigation agent, to reveal the reasoning capability of GPT models in complex embodied scenes by performing zero-shot sequential action prediction for vision-and-language navigation (VLN). At each step, NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status, and makes the decision to approach the target. Through comprehensive experiments, we demonstrate NavGPT can explicitly perform high-level planning for navigation, including decomposing instruction into sub-goal, integrating commonsense knowledge relevant to navigation task resolution, identifying landmarks from observed scenes, tracking navigation progress, and adapting to exceptions with plan adjustment. Furthermore, we show that LLMs is capable of generating high-quality navigational instructions from observations and actions along a path, as well as drawing accurate top-down metric trajectory given the agent's navigation history. Despite the performance of using NavGPT to zero-shot R2R tasks still falling short of trained models, we suggest adapting multi-modality inputs for LLMs to use as visual navigation agents and applying the explicit reasoning of LLMs to benefit learning-based models.

翻译：通过前所未有的数据规模训练，ChatGPT和GPT-4等大语言模型（LLMs）展现出显著推理能力的涌现。这一趋势凸显了利用无限语言数据训练LLMs的潜力，推动了通用具身智能体的发展。在本研究中，我们提出NavGPT——一种完全基于LLM的指令跟随导航智能体，通过执行视觉与语言导航（VLN）中的零样本序列动作预测，揭示GPT模型在复杂具身场景中的推理能力。在每个步骤中，NavGPT将视觉观察、导航历史及未来可探索方向的文本描述作为输入，推理智能体的当前状态，并做出接近目标的决策。通过全面实验，我们证明NavGPT能够显式执行导航的高层规划，包括将指令分解为子目标、整合与导航任务解决相关的常识知识、从观察场景中识别地标、跟踪导航进度，以及通过计划调整适应异常情况。此外，我们表明LLMs能够根据路径上的观察和动作生成高质量的导航指令，并基于智能体的导航历史绘制精确的俯视度量轨迹。尽管使用NavGPT进行R2R任务的零样本性能仍低于训练模型，但建议为LLMs适配多模态输入以用作视觉导航智能体，并利用LLMs的显式推理能力使基于学习的模型受益。