Vision-and-Language Navigation (VLN) is the task that requires an agent to navigate through the environment based on natural language instructions. At each step, the agent takes the next action by selecting from a set of navigable locations. In this paper, we aim to take one step further and explore whether the agent can benefit from generating the potential future view during navigation. Intuitively, humans will have an expectation of how the future environment will look like, based on the natural language instructions and surrounding views, which will aid correct navigation. Hence, to equip the agent with this ability to generate the semantics of future navigation views, we first propose three proxy tasks during the agent's in-domain pre-training: Masked Panorama Modeling (MPM), Masked Trajectory Modeling (MTM), and Action Prediction with Image Generation (APIG). These three objectives teach the model to predict missing views in a panorama (MPM), predict missing steps in the full trajectory (MTM), and generate the next view based on the full instruction and navigation history (APIG), respectively. We then fine-tune the agent on the VLN task with an auxiliary loss that minimizes the difference between the view semantics generated by the agent and the ground truth view semantics of the next step. Empirically, our VLN-SIG achieves the new state-of-the-art on both the Room-to-Room dataset and the CVDN dataset. We further show that our agent learns to fill in missing patches in future views qualitatively, which brings more interpretability over agents' predicted actions. Lastly, we demonstrate that learning to predict future view semantics also enables the agent to have better performance on longer paths.
翻译:视觉与语言导航(VLN)是一项要求智能体基于自然语言指令在环境中导航的任务。智能体在每个步骤中通过从一组可导航位置中选择来执行下一个动作。本文旨在进一步探索:在导航过程中生成潜在未来视图是否能使智能体受益。直观上,人类会基于自然语言指令和周围视图对未来环境产生预期,这有助于正确导航。因此,为使智能体具备生成未来导航视图语义的能力,我们首先在智能体的域内预训练阶段提出三个代理任务:掩码全景建模(MPM)、掩码轨迹建模(MTM)和基于图像生成的动作预测(APIG)。这三个目标分别训练模型预测全景中缺失的视图(MPM)、预测完整轨迹中缺失的步骤(MTM),以及基于完整指令和导航历史生成下一视图(APIG)。随后,我们通过辅助损失函数对智能体进行VLN任务的微调,该损失函数最小化智能体生成的视图语义与下一步真实视图语义之间的差异。实验表明,我们的VLN-SIG在Room-to-Room数据集和CVDN数据集上均达到了新的最优性能。我们进一步定性地证明,智能体能够学习填补未来视图中的缺失区域,从而提升其预测动作的可解释性。最后,我们证明学习预测未来视图语义还能使智能体在较长路径上获得更优表现。