Real-world deployment of Vision-and-Language Navigation (VLN) agents is constrained by the scarcity of reliable supervision after offline training. While recent adaptation methods attempt to mitigate distribution shifts via environment-driven self-supervision (e.g., entropy minimization), these signals are often noisy and can cause the agent to amplify its own mistakes during long-horizon sequential decision-making. In this paper, we propose a paradigm shift that positions user feedback, specifically episode-level success confirmations and goal-level corrections, as a primary and general-purpose supervision signal for VLN. Unlike internal confidence scores, user feedback is intent-aligned and in-situ consistent, directly correcting the agent's decoupling from user instructions. To effectively leverage this supervision, we introduce a user-feedback-driven learning framework featuring a topology-aware trajectory construction pipeline. This mechanism lifts sparse, goal-level corrections into dense path-level supervision by generating feasible paths on the agent's incrementally built topological graph, enabling sample-efficient imitation learning without requiring step-by-step human demonstrations. Furthermore, we develop a persistent memory bank mechanism for warm-start initialization, supporting the reuse of previously acquired topology and cached representations across navigation sessions. Extensive experiments on the GSA-R2R benchmark demonstrate that our approach transforms sparse interaction into robust supervision, consistently outperforming environment-driven baselines while exhibiting strong adaptability across diverse instruction styles.
翻译:视觉语言导航(VLN)智能体在现实世界中的部署受到离线训练后可靠监督稀缺的限制。尽管近期的适应方法尝试通过环境驱动的自监督(例如熵最小化)来缓解分布偏移,但这些信号通常存在噪声,并可能导致智能体在长时程序贯决策过程中放大自身错误。本文提出一种范式转变,将用户反馈——特别是任务级成功确认与目标级修正——定位为VLN主要且通用的监督信号。与内部置信度分数不同,用户反馈具有意图对齐与现场一致性,可直接纠正智能体与用户指令的解耦问题。为有效利用该监督机制,我们提出一种用户反馈驱动的学习框架,其核心为拓扑感知轨迹构建流程。该机制通过在智能体增量构建的拓扑图上生成可行路径,将稀疏的目标级修正提升为稠密的路径级监督,从而在无需逐步人工演示的情况下实现样本高效的模仿学习。此外,我们开发了持久记忆库机制用于热启动初始化,支持跨导航会话复用先前获取的拓扑结构与缓存表征。在GSA-R2R基准上的大量实验表明,我们的方法能将稀疏交互转化为鲁棒的监督信号,在持续超越环境驱动基线的同时,展现出对不同指令风格的强大适应能力。