In Vision-and-Language Navigation (VLN), an agent is required to plan a path to the target specified by the language instruction, using its visual observations. Consequently, prevailing VLN methods primarily focus on building powerful planners through visual-textual alignment. However, these approaches often bypass the imperative of comprehensive scene understanding prior to planning, leaving the agent with insufficient perception or prediction capabilities. Thus, we propose P$^{3}$Nav, a novel end-to-end framework integrating perception, prediction, and planning in a unified pipeline to strengthen the VLN agent's scene understanding and boost navigation success. Specifically, P$^{3}$Nav augments perception by extracting complementary cues from object-level and map-level perspectives. Subsequently, our P$^{3}$Nav predicts waypoints to model the agent's potential future states, endowing the agent with intrinsic awareness of candidate positions during navigation. Conditioned on these future waypoints, P$^{3}$Nav further forecasts semantic map cues, enabling proactive planning and reducing the strict reliance on purely historical context. Integrating these perceptual and predictive cues, a holistic planning module finally carries out the VLN tasks. Extensive experiments demonstrate that our P$^{3}$Nav achieves new state-of-the-art performance on the REVERIE, R2R-CE, and RxR-CE benchmarks.
翻译:在视觉与语言导航任务中,智能体需要根据语言指令指定的目标,利用其视觉观测来规划路径。因此,主流的VLN方法主要侧重于通过视觉-文本对齐来构建强大的规划器。然而,这些方法往往忽视了在规划之前进行全面场景理解的必要性,导致智能体的感知或预测能力不足。为此,我们提出P$^{3}$Nav,一种新颖的端到端框架,将感知、预测与规划集成于统一流程中,以增强VLN智能体的场景理解能力并提升导航成功率。具体而言,P$^{3}$Nav通过从物体级和地图级视角提取互补线索来增强感知能力。随后,我们的P$^{3}$Nav预测航点以建模智能体潜在的未来状态,使其在导航过程中对候选位置具备内在认知。基于这些未来航点,P$^{3}$Nav进一步预测语义地图线索,从而实现主动规划,并减少对纯历史上下文的严格依赖。整合这些感知与预测线索后,一个整体规划模块最终执行VLN任务。大量实验表明,我们的P$^{3}$Nav在REVERIE、R2R-CE和RxR-CE基准测试中取得了新的最先进性能。