The study of vision-and-language navigation (VLN) has typically relied on expert trajectories, which may not always be available in real-world situations due to the significant effort required to collect them. On the other hand, existing approaches to training VLN agents that go beyond available expert data involve data augmentations or online exploration which can be tedious and risky. In contrast, it is easy to access large repositories of suboptimal offline trajectories. Inspired by research in offline reinforcement learning (ORL), we introduce a new problem setup of VLN-ORL which studies VLN using suboptimal demonstration data. We introduce a simple and effective reward-conditioned approach that can account for dataset suboptimality for training VLN agents, as well as benchmarks to evaluate progress and promote research in this area. We empirically study various noise models for characterizing dataset suboptimality among other unique challenges in VLN-ORL and instantiate it for the VLN$\circlearrowright$BERT and MTVM architectures in the R2R and RxR environments. Our experiments demonstrate that the proposed reward-conditioned approach leads to significant performance improvements, even in complex and intricate environments.
翻译:摘要:视觉与语言导航(VLN)的研究通常依赖于专家轨迹,然而在现实场景中,由于收集此类轨迹需要大量人力,这些数据往往难以获取。另一方面,现有超越可用专家数据的VLN智能体训练方法通常涉及数据增强或在线探索,这些方法既繁琐又存在风险。相比之下,获取大规模次优离线轨迹数据则相对容易。受离线强化学习(ORL)研究的启发,我们提出了一个名为VLN-ORL的新问题设定,旨在利用次优演示数据研究VLN。我们引入了一种简单有效的奖励条件方法,能够处理训练VLN智能体时数据集的次优性问题,同时建立了基准测试以评估进展并推动该领域研究。我们通过实证研究了多种噪声模型来刻画数据集的次优性,并探讨了VLN-ORL中的其他独特挑战,随后将其应用于R2R和RxR环境中的VLN⟲BERT与MTVM架构。实验表明,即使面对复杂且环境,我们提出的奖励条件方法也能显著提升性能。