$A^2$Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models

We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions without requiring any path-instruction annotation data. Normally, the instructions have complex grammatical structures and often contain various action descriptions (e.g., "proceed beyond", "depart from"). How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging. Note that a well-educated human being can easily understand path instructions without the need for any special training. In this paper, we propose an action-aware zero-shot VLN method ($A^2$Nav) by exploiting the vision-and-language ability of foundation models. Specifically, the proposed method consists of an instruction parser and an action-aware navigation policy. The instruction parser utilizes the advanced reasoning ability of large language models (e.g., GPT-3) to decompose complex navigation instructions into a sequence of action-specific object navigation sub-tasks. Each sub-task requires the agent to localize the object and navigate to a specific goal position according to the associated action demand. To accomplish these sub-tasks, an action-aware navigation policy is learned from freely collected action-specific datasets that reveal distinct characteristics of each action demand. We use the learned navigation policy for executing sub-tasks sequentially to follow the navigation instruction. Extensive experiments show $A^2$Nav achieves promising ZS-VLN performance and even surpasses the supervised learning methods on R2R-Habitat and RxR-Habitat datasets.

翻译：我们研究了零样本视觉-语言导航（ZS-VLN）任务，这是一个实用但具有挑战性的问题，要求智能体在无需任何路径-指令标注数据的情况下，根据语言指令描述的路径进行导航。通常，这些指令具有复杂的语法结构，并常包含多种行动描述（例如“越过”、“离开”）。如何正确理解并执行这些行动需求是一个关键问题，而标注数据的缺失更增加了其难度。值得注意的是，受过良好教育的人类无需任何特殊训练即可轻松理解路径指令。本文提出了一种行动感知的零样本视觉-语言导航方法（$A^2$Nav），通过利用基础模型的视觉与语言能力。具体来说，该方法包括一个指令解析器和一个行动感知的导航策略。指令解析器利用大型语言模型（例如GPT-3）的高级推理能力，将复杂的导航指令分解为一系列特定行动的对象导航子任务。每个子任务要求智能体根据相关行动需求定位目标对象并导航至特定目标位置。为完成这些子任务，我们基于自由收集的、揭示各行动需求独特特征的特定行动数据集，学习了一个行动感知的导航策略。通过依次执行子任务，该策略实现指令跟随。大量实验表明，$A^2$Nav在零样本视觉-语言导航任务上取得了有竞争力的性能，甚至在R2R-Habitat和RxR-Habitat数据集上超越了监督学习方法。