Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation

LLM-based agents have demonstrated impressive zero-shot performance in vision-language navigation (VLN) task. However, existing LLM-based methods often focus only on solving high-level task planning by selecting nodes in predefined navigation graphs for movements, overlooking low-level control in navigation scenarios. To bridge this gap, we propose AO-Planner, a novel Affordances-Oriented Planner for continuous VLN task. Our AO-Planner integrates various foundation models to achieve affordances-oriented low-level motion planning and high-level decision-making, both performed in a zero-shot setting. Specifically, we employ a Visual Affordances Prompting (VAP) approach, where the visible ground is segmented by SAM to provide navigational affordances, based on which the LLM selects potential candidate waypoints and plans low-level paths towards selected waypoints. We further propose a high-level PathAgent which marks planned paths into the image input and reasons the most probable path by comprehending all environmental information. Finally, we convert the selected path into 3D coordinates using camera intrinsic parameters and depth information, avoiding challenging 3D predictions for LLMs. Experiments on the challenging R2R-CE and RxR-CE datasets show that AO-Planner achieves state-of-the-art zero-shot performance (8.8% improvement on SPL). Our method can also serve as a data annotator to obtain pseudo-labels, distilling its waypoint prediction ability into a learning-based predictor. This new predictor does not require any waypoint data from the simulator and achieves 47% SR competing with supervised methods. We establish an effective connection between LLM and 3D world, presenting novel prospects for employing foundation models in low-level motion control.

翻译：基于大语言模型（LLM）的智能体在视觉语言导航（VLN）任务中展现了令人印象深刻的零样本性能。然而，现有的基于LLM的方法通常仅专注于解决高层任务规划，即在预定义的导航图中选择节点进行移动，而忽视了导航场景中的底层控制。为了弥合这一差距，我们提出了AO-Planner，一种用于连续VLN任务的新型可供性导向规划器。我们的AO-Planner集成了多种基础模型，以实现可供性导向的底层运动规划和高层决策，两者均在零样本设置下执行。具体而言，我们采用了一种视觉可供性提示（VAP）方法，其中可见地面由SAM分割以提供导航可供性，LLM基于此选择潜在的候选路径点，并规划通往选定路径点的底层路径。我们进一步提出了一个高层路径智能体（PathAgent），它将规划好的路径标记到图像输入中，并通过理解所有环境信息来推理最可能的路径。最后，我们利用相机内参和深度信息将选定的路径转换为3D坐标，避免了LLM进行具有挑战性的3D预测。在具有挑战性的R2R-CE和RxR-CE数据集上的实验表明，AO-Planner实现了最先进的零样本性能（SPL指标提升8.8%）。我们的方法还可以作为数据标注器来获取伪标签，将其路径点预测能力蒸馏到一个基于学习的预测器中。这个新的预测器不需要来自模拟器的任何路径点数据，并实现了47%的成功率（SR），与有监督方法相媲美。我们在LLM与3D世界之间建立了一个有效的连接，为在底层运动控制中运用基础模型展现了新的前景。