NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models

Trained with an unprecedented scale of data, large language models (LLMs) like ChatGPT and GPT-4 exhibit the emergence of significant reasoning abilities from model scaling. Such a trend underscored the potential of training LLMs with unlimited language data, advancing the development of a universal embodied agent. In this work, we introduce the NavGPT, a purely LLM-based instruction-following navigation agent, to reveal the reasoning capability of GPT models in complex embodied scenes by performing zero-shot sequential action prediction for vision-and-language navigation (VLN). At each step, NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status, and makes the decision to approach the target. Through comprehensive experiments, we demonstrate NavGPT can explicitly perform high-level planning for navigation, including decomposing instruction into sub-goal, integrating commonsense knowledge relevant to navigation task resolution, identifying landmarks from observed scenes, tracking navigation progress, and adapting to exceptions with plan adjustment. Furthermore, we show that LLMs is capable of generating high-quality navigational instructions from observations and actions along a path, as well as drawing accurate top-down metric trajectory given the agent's navigation history. Despite the performance of using NavGPT to zero-shot R2R tasks still falling short of trained models, we suggest adapting multi-modality inputs for LLMs to use as visual navigation agents and applying the explicit reasoning of LLMs to benefit learning-based models.

翻译：摘要：通过前所未有规模的数据训练，像ChatGPT和GPT-4这样的大语言模型（LLMs）展现出模型规模扩展带来的显著推理能力涌现。这一趋势凸显了利用无限语言数据训练LLMs的潜力，推动了通用具身智能体的发展。本文提出了NavGPT——一种纯LLM驱动的指令跟随导航智能体，通过为视觉与语言导航（VLN）任务执行零样本序列动作预测，揭示了GPT模型在复杂具身场景中的推理能力。在每个步骤中，NavGPT将视觉观测的文本描述、导航历史以及未来可探索方向作为输入，推理智能体当前状态并做出接近目标的决策。通过全面实验，我们证明NavGPT能够显式执行导航高层规划，包括将指令分解为子目标、整合与导航任务解决相关的常识知识、从观测场景中识别地标、追踪导航进度以及通过调整计划适应异常情况。此外，我们发现LLMs能够根据路径上的观测和动作生成高质量的导航指令，并能基于智能体导航历史绘制精确的俯视度量轨迹。尽管在R2R任务中使用NavGPT进行零样本学习的性能仍逊于训练模型，但我们建议为LLMs适配多模态输入以用作视觉导航智能体，并利用LLMs的显式推理能力使基于学习的模型受益。