View Invariant Learning for Vision-Language Navigation in Continuous Environments

Vision-Language Navigation in Continuous Environments (VLNCE), where an agent follows instructions and moves freely to reach a destination, is a key research problem in embodied AI. However, most existing approaches are sensitive to viewpoint changes, i.e. variations in camera height and viewing angle. Here we introduce a more general scenario, V$^2$-VLNCE (VLNCE with Varied Viewpoints) and propose a view-invariant post-training framework, called VIL (View Invariant Learning), that makes existing navigation policies more robust to changes in camera viewpoint. VIL employs a contrastive learning framework to learn sparse and view-invariant features. We also introduce a teacher-student framework for the Waypoint Predictor Module, a standard part of VLNCE baselines, where a view-dependent teacher model distills knowledge into a view-invariant student model. We employ an end-to-end training paradigm to jointly optimize these components. Empirical results show that our method outperforms state-of-the-art approaches on V$^2$-VLNCE by 8-15\% measured on Success Rate for two standard benchmark datasets R2R-CE and RxR-CE. Evaluation of VIL in standard VLNCE settings shows that despite being trained for varied viewpoints, VIL often still improves performance. On the harder RxR-CE dataset, our method also achieved state-of-the-art performance across all metrics. This suggests that adding VIL does not diminish the standard viewpoint performance and can serve as a plug-and-play post-training method. We further evaluate VIL for simulated camera placements derived from real robot configurations (e.g. Stretch RE-1, LoCoBot), showing consistent improvements of performance. Finally, we present a proof-of-concept real-robot evaluation in two physical environments using a panoramic RGB sensor combined with LiDAR. The code is available at https://github.com/realjoshqsun/V2-VLNCE.

翻译：连续环境中的视觉语言导航（VLNCE）是具身人工智能领域的一个关键研究问题，其要求智能体遵循指令并自由移动以抵达目标。然而，现有方法大多对视角变化（即相机高度和观察角度的变化）敏感。本文引入了一个更通用的场景——V$^2$-VLNCE（具有多变视角的VLNCE），并提出了一种视角不变的后训练框架，称为VIL（视角不变学习），旨在提升现有导航策略对相机视角变化的鲁棒性。VIL采用对比学习框架来学习稀疏且视角不变的特征。我们还为VLNCE基线模型的标准组件——航点预测器模块，引入了一个师生框架，其中依赖视角的教师模型将知识蒸馏到视角不变的学生模型中。我们采用端到端的训练范式来联合优化这些组件。实验结果表明，在两个标准基准数据集R2R-CE和RxR-CE上，我们的方法在成功率指标上比现有最先进方法高出8-15%。在标准VLNCE设置下对VIL的评估表明，尽管VIL是针对多变视角进行训练的，但它通常仍能提升性能。在更具挑战性的RxR-CE数据集上，我们的方法在所有指标上也达到了最先进的性能。这表明添加VIL不会削弱标准视角下的性能，并且可以作为一种即插即用的后训练方法。我们进一步评估了VIL在源自真实机器人配置（例如Stretch RE-1、LoCoBot）的模拟相机放置下的性能，结果显示性能得到了一致的提升。最后，我们使用全景RGB传感器结合激光雷达，在两个物理环境中进行了概念验证性的真实机器人评估。代码可在 https://github.com/realjoshqsun/V2-VLNCE 获取。