Vision-Language Navigation in Continuous Environments (VLNCE), where an agent follows instructions and moves freely to reach a destination, is a key research problem in embodied AI. However, most navigation policies are sensitive to viewpoint changes, i.e., variations in camera height and viewing angle that alter the agent's observation. In this paper, we introduce a generalized scenario, V2-VLNCE (VLNCE with Varied Viewpoints), and propose VIL (View Invariant Learning), a view-invariant post-training strategy that enhances the robustness of existing navigation policies to changes in camera viewpoint. VIL employs a contrastive learning framework to learn sparse and view-invariant features. Additionally, we introduce a teacher-student framework for the Waypoint Predictor Module, a core component of most VLNCE baselines, where a view-dependent teacher model distills knowledge into a view-invariant student model. We employ an end-to-end training paradigm to jointly optimize these components, thus eliminating the cost for individual module training. Empirical results show that our method outperforms state-of-the-art approaches on V2-VLNCE by 8-15% measured on Success Rate for two standard benchmark datasets R2R-CE and RxR-CE. Furthermore, we evaluate VIL under the standard VLNCE setting and find that, despite being trained for varied viewpoints, it often still improves performance. On the more challenging RxR-CE dataset, our method also achieved state-of-the-art performance across all metrics when compared to other map-free methods. This suggests that adding VIL does not diminish the standard viewpoint performance and can serve as a plug-and-play post-training method.
翻译:连续环境中的视觉语言导航(VLNCE)是具身人工智能领域的一个关键研究问题,其中智能体遵循指令并自由移动以到达目的地。然而,大多数导航策略对视角变化(即相机高度和视角的变化会改变智能体的观测)敏感。本文引入了一个广义场景——V2-VLNCE(具有多变视角的VLNCE),并提出了一种视角不变的后训练策略VIL(视角不变性学习),以增强现有导航策略对相机视角变化的鲁棒性。VIL采用对比学习框架来学习稀疏且视角不变的特征。此外,我们为大多数VLNCE基准模型的核心组件——航路点预测模块——引入了一个师生框架,其中一个视角依赖的教师模型将知识蒸馏到一个视角不变的学生模型中。我们采用端到端的训练范式来联合优化这些组件,从而消除了单独模块训练的成本。实验结果表明,在R2R-CE和RxR-CE两个标准基准数据集上,我们的方法在成功率指标上比现有最优方法高出8-15%。此外,我们在标准VLNCE设置下评估了VIL,发现尽管它是为多变视角训练的,但通常仍能提升性能。在更具挑战性的RxR-CE数据集上,与其他无地图方法相比,我们的方法在所有指标上也达到了最优性能。这表明添加VIL不会降低标准视角下的性能,并且可以作为一种即插即用的后训练方法。