NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models

Trained with an unprecedented scale of data, large language models (LLMs) like ChatGPT and GPT-4 exhibit the emergence of significant reasoning abilities from model scaling. Such a trend underscored the potential of training LLMs with unlimited language data, advancing the development of a universal embodied agent. In this work, we introduce the NavGPT, a purely LLM-based instruction-following navigation agent, to reveal the reasoning capability of GPT models in complex embodied scenes by performing zero-shot sequential action prediction for vision-and-language navigation (VLN). At each step, NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status, and makes the decision to approach the target. Through comprehensive experiments, we demonstrate NavGPT can explicitly perform high-level planning for navigation, including decomposing instruction into sub-goal, integrating commonsense knowledge relevant to navigation task resolution, identifying landmarks from observed scenes, tracking navigation progress, and adapting to exceptions with plan adjustment. Furthermore, we show that LLMs is capable of generating high-quality navigational instructions from observations and actions along a path, as well as drawing accurate top-down metric trajectory given the agent's navigation history. Despite the performance of using NavGPT to zero-shot R2R tasks still falling short of trained models, we suggest adapting multi-modality inputs for LLMs to use as visual navigation agents and applying the explicit reasoning of LLMs to benefit learning-based models.

翻译：摘要：基于前所未有规模的数据训练，诸如ChatGPT和GPT-4等大型语言模型通过模型缩放显著展现出推理能力的涌现。这一趋势凸显了利用无限语言数据训练大型语言模型的潜力，推动了通用具身智能体（embodied agent）的发展。本文提出NavGPT——一种纯基于大型语言模型的指令跟随导航智能体，通过为零样本视觉与语言导航（VLN）任务预测序列动作，揭示了GPT模型在复杂具身场景中的推理能力。每步执行时，NavGPT将视觉观测的文本描述、导航历史及未来可探索方向作为输入，推理智能体的当前状态，并做出接近目标的决策。通过全面实验，我们证明NavGPT能够显式执行导航的高层级规划，包括将指令分解为子目标、整合与导航任务解决相关的常识知识、从观测场景中识别地标、跟踪导航进度，以及通过计划调整适应异常情况。此外，我们展示了大型语言模型能够根据路径中的观测与动作生成高质量导航指令，并基于智能体的导航历史绘制精确的俯视度量轨迹。尽管NavGPT在零样本R2R任务上的性能仍逊于训练后的模型，我们建议将多模态输入适配于作为视觉导航智能体的大型语言模型，并利用大型语言模型的显式推理能力为基于学习的模型提供增益。