Outdoor Vision-and-Language Navigation (VLN) requires an agent to navigate through realistic 3D outdoor environments based on natural language instructions. The performance of existing VLN methods is limited by insufficient diversity in navigation environments and limited training data. To address these issues, we propose VLN-Video, which utilizes the diverse outdoor environments present in driving videos in multiple cities in the U.S. augmented with automatically generated navigation instructions and actions to improve outdoor VLN performance. VLN-Video combines the best of intuitive classical approaches and modern deep learning techniques, using template infilling to generate grounded navigation instructions, combined with an image rotation similarity-based navigation action predictor to obtain VLN style data from driving videos for pretraining deep learning VLN models. We pre-train the model on the Touchdown dataset and our video-augmented dataset created from driving videos with three proxy tasks: Masked Language Modeling, Instruction and Trajectory Matching, and Next Action Prediction, so as to learn temporally-aware and visually-aligned instruction representations. The learned instruction representation is adapted to the state-of-the-art navigator when fine-tuning on the Touchdown dataset. Empirical results demonstrate that VLN-Video significantly outperforms previous state-of-the-art models by 2.1% in task completion rate, achieving a new state-of-the-art on the Touchdown dataset.
翻译:摘要:户外视觉语言导航要求智能体依据自然语言指令在逼真的三维户外环境中完成导航。现有方法受限于导航环境多样性不足及训练数据匮乏,导致性能受限。为解决上述问题,本文提出VLN-Video框架,通过利用美国多个城市行车视频中的多样化户外场景,辅以自动生成的导航指令与动作,提升户外VLN性能。该框架融合了直觉性经典方法与现代深度学习技术的优势,采用模板填充生成语义锚定的导航指令,并联合基于图像旋转相似度的导航动作预测器,从行车视频中提取VLN风格数据用于预训练深度学习VLN模型。我们基于Touchdown数据集及行车视频增强数据集,通过掩码语言建模、指令-轨迹匹配及下一动作预测三种代理任务进行模型预训练,从而学习时序感知且视觉对齐的指令表征。将该指令表征适配至当前最先进的导航模型并在Touchdown数据集上微调后,实验结果表明VLN-Video在任务完成率上显著超越此前最优模型2.1%,在Touchdown数据集上达到了新的最佳性能水平。