Vision-and-Language Navigation (VLN) has gained significant research interest in recent years due to its potential applications in real-world scenarios. However, existing VLN methods struggle with the issue of spurious associations, resulting in poor generalization with a significant performance gap between seen and unseen environments. In this paper, we tackle this challenge by proposing a unified framework CausalVLN based on the causal learning paradigm to train a robust navigator capable of learning unbiased feature representations. Specifically, we establish reasonable assumptions about confounders for vision and language in VLN using the structured causal model (SCM). Building upon this, we propose an iterative backdoor-based representation learning (IBRL) method that allows for the adaptive and effective intervention on confounders. Furthermore, we introduce the visual and linguistic backdoor causal encoders to enable unbiased feature expression for multi-modalities during training and validation, enhancing the agent's capability to generalize across different environments. Experiments on three VLN datasets (R2R, RxR, and REVERIE) showcase the superiority of our proposed method over previous state-of-the-art approaches. Moreover, detailed visualization analysis demonstrates the effectiveness of CausalVLN in significantly narrowing down the performance gap between seen and unseen environments, underscoring its strong generalization capability.
翻译:视觉与语言导航因其在现实场景中的潜在应用而在近年来引起了广泛研究关注。然而,现有VLN方法存在虚假关联问题,导致在已知与未知环境之间存在显著性能差距,泛化能力较差。本文通过提出基于因果学习范式的统一框架CausalVLN来应对这一挑战,旨在训练能够学习无偏特征表征的稳健导航器。具体而言,我们利用结构化因果模型为VLN中的视觉与语言建立了关于混杂因素的合理假设。在此基础上,我们提出了一种迭代后门表征学习方法,能够自适应且有效地对混杂因素进行干预。此外,我们引入了视觉与语言后门因果编码器,使导航器在训练和验证阶段能实现多模态的无偏特征表达,从而增强其跨环境泛化能力。在三个VLN数据集(R2R、RxR和REVERIE)上的实验展示了我们方法相较于先前最先进方法的优越性。详细的视觉分析进一步表明,CausalVLN能显著缩小已知与未知环境之间的性能差距,有力证明了其强大的泛化能力。