Vision-and-language navigation (VLN) agents are trained to navigate in real-world environments by following natural language instructions. A major challenge in VLN is the limited availability of training data, which hinders the models' ability to generalize effectively. Previous approaches have attempted to address this issue by introducing additional supervision during training, often requiring costly human-annotated data that restricts scalability. In this paper, we introduce a masked path modeling (MPM) objective, which pretrains an agent using self-collected data for downstream navigation tasks. Our proposed method involves allowing the agent to actively explore navigation environments without a specific goal and collect the paths it traverses. Subsequently, we train the agent on this collected data to reconstruct the original path given a randomly masked subpath. This way, the agent can actively accumulate a diverse and substantial amount of data while learning conditional action generation. To evaluate the effectiveness of our technique, we conduct experiments on various VLN datasets and demonstrate the versatility of MPM across different levels of instruction complexity. Our results exhibit significant improvements in success rates, with enhancements of 1.32\%, 1.05\%, and 1.19\% on the val-unseen split of the Room-to-Room, Room-for-Room, and Room-across-Room datasets, respectively. Furthermore, we conduct an analysis that highlights the potential for additional improvements when the agent is allowed to explore unseen environments prior to testing.
翻译:视觉-语言导航(VLN)智能体通过遵循自然语言指令,在真实环境中进行导航训练。VLN面临的主要挑战是训练数据的有限可用性,这阻碍了模型有效泛化的能力。以往的研究尝试通过在训练过程中引入额外监督来解决这一问题,但通常需要昂贵的人工标注数据,限制了可扩展性。本文提出了一种掩蔽路径建模(MPM)目标,该目标利用智能体自收集的数据预训练模型,以用于下游导航任务。我们的方法允许智能体在没有特定目标的情况下主动探索导航环境,并收集其遍历的路径。随后,我们在收集的数据上训练智能体,以根据随机掩蔽的子路径重建原始路径。通过这种方式,智能体可以主动积累多样化且大量数据,同时学习条件动作生成。为评估我们技术的有效性,我们在多个VLN数据集上进行了实验,展示了MPM在不同指令复杂度水平下的多功能性。我们的结果在成功率上取得了显著提升,在Room-to-Room、Room-for-Room和Room-across-Room数据集的val-unseen划分上分别提升了1.32%、1.05%和1.19%。此外,我们还进行了分析,表明允许智能体在测试前探索未见环境具有额外改进潜力。