LLM-based agents have demonstrated impressive zero-shot performance in the vision-language navigation (VLN) task. However, these zero-shot methods focus only on solving high-level task planning by selecting nodes in predefined navigation graphs for movements, overlooking low-level control in realistic navigation scenarios. To bridge this gap, we propose AO-Planner, a novel affordances-oriented planning framework for continuous VLN task. Our AO-Planner integrates various foundation models to achieve affordances-oriented motion planning and action decision-making, both performed in a zero-shot manner. Specifically, we employ a visual affordances prompting (VAP) approach, where visible ground is segmented utilizing SAM to provide navigational affordances, based on which the LLM selects potential next waypoints and generates low-level path planning towards selected waypoints. We further introduce a high-level agent, PathAgent, to identify the most probable pixel-based path and convert it into 3D coordinates to fulfill low-level motion. Experimental results on the challenging R2R-CE benchmark demonstrate that AO-Planner achieves state-of-the-art zero-shot performance (5.5% improvement in SPL). Our method establishes an effective connection between LLM and 3D world to circumvent the difficulty of directly predicting world coordinates, presenting novel prospects for employing foundation models in low-level motion control.
翻译:基于大语言模型(LLM)的智能体在视觉语言导航(VLN)任务中展现了令人印象深刻的零样本性能。然而,这些零样本方法仅专注于解决高层任务规划,即通过选择预定义导航图中的节点进行移动,忽视了现实导航场景中的底层控制。为弥合这一差距,我们提出了AO-Planner,一种面向可操作性的新型规划框架,用于连续VLN任务。我们的AO-Planner集成了多种基础模型,以零样本方式实现面向可操作性的运动规划与动作决策。具体而言,我们采用视觉可操作性提示(VAP)方法,利用SAM分割可见地面以提供导航可操作性区域,在此基础上,LLM选择潜在的下一个路径点并生成朝向所选路径点的底层路径规划。我们进一步引入一个高层智能体PathAgent,以识别最可能的基于像素的路径,并将其转换为三维坐标以实现底层运动。在具有挑战性的R2R-CE基准测试上的实验结果表明,AO-Planner实现了最先进的零样本性能(SPL提升5.5%)。我们的方法在LLM与三维世界之间建立了有效连接,以规避直接预测世界坐标的困难,为基础模型应用于底层运动控制展现了新的前景。