Vision-and-language navigation (VLN) asks an agent to follow a given language instruction to navigate through a real 3D environment. Despite significant advances, conventional VLN agents are trained typically under disturbance-free environments and may easily fail in real-world scenarios, since they are unaware of how to deal with various possible disturbances, such as sudden obstacles or human interruptions, which widely exist and may usually cause an unexpected route deviation. In this paper, we present a model-agnostic training paradigm, called Progressive Perturbation-aware Contrastive Learning (PROPER) to enhance the generalization ability of existing VLN agents, by requiring them to learn towards deviation-robust navigation. Specifically, a simple yet effective path perturbation scheme is introduced to implement the route deviation, with which the agent is required to still navigate successfully following the original instruction. Since directly enforcing the agent to learn perturbed trajectories may lead to inefficient training, a progressively perturbed trajectory augmentation strategy is designed, where the agent can self-adaptively learn to navigate under perturbation with the improvement of its navigation performance for each specific trajectory. For encouraging the agent to well capture the difference brought by perturbation, a perturbation-aware contrastive learning mechanism is further developed by contrasting perturbation-free trajectory encodings and perturbation-based counterparts. Extensive experiments on R2R show that PROPER can benefit multiple VLN baselines in perturbation-free scenarios. We further collect the perturbed path data to construct an introspection subset based on the R2R, called Path-Perturbed R2R (PP-R2R). The results on PP-R2R show unsatisfying robustness of popular VLN agents and the capability of PROPER in improving the navigation robustness.
翻译:视觉与语言导航要求智能体依据给定的语言指令在真实三维环境中导航。尽管取得了显著进展,但传统视觉与语言导航智能体通常在与干扰无关的环境中训练,容易在现实场景中失败,因为它们不知道如何处理各种可能的干扰,如突发障碍物或人类干扰,这些干扰普遍存在且可能导致意外路线偏离。本文提出一种模型无关的训练范式——渐进式扰动感知对比学习(PROPER),通过要求现有视觉与语言导航智能体学习抗偏离鲁棒导航来增强其泛化能力。具体而言,引入了一种简单有效的路径扰动方案来实现路线偏离,要求智能体在扰动下仍能依据原始指令成功导航。由于直接强制智能体学习扰动轨迹可能导致训练效率低下,我们设计了一种渐进式扰动轨迹增强策略,使智能体能够根据每条特定轨迹的导航性能改进来自适应学习在扰动下导航。为进一步鼓励智能体捕捉扰动带来的差异,我们开发了一种扰动感知对比学习机制,通过对比无扰动轨迹编码与基于扰动的轨迹编码来实现。在R2R数据集上的大量实验表明,PROPER能够使多种视觉与语言导航基线模型在无扰动场景中受益。我们进一步基于R2R收集了扰动路径数据,构建了一个内省子集,称为路径扰动R2R(PP-R2R)。PP-R2R上的结果显示,主流视觉与语言导航智能体的鲁棒性不足,而PROPER能有效提升导航鲁棒性。