Large-scale pre-training has shown promising results on the vision-and-language navigation (VLN) task. However, most existing pre-training methods employ discrete panoramas to learn visual-textual associations. This requires the model to implicitly correlate incomplete, duplicate observations within the panoramas, which may impair an agent's spatial understanding. Thus, we propose a new map-based pre-training paradigm that is spatial-aware for use in VLN. Concretely, we build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map. This hybrid design can balance the demand of VLN for both short-term reasoning and long-term planning. Then, based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal. Extensive experiments demonstrate the effectiveness of the map-based pre-training route for VLN, and the proposed method achieves state-of-the-art on four VLN benchmarks.
翻译:大规模预训练在视觉与语言导航(VLN)任务上已展现出显著成效。然而,现有预训练方法大多采用离散全景图来学习视觉-文本关联,这要求模型隐式关联全景图中不完整且重复的观测,可能削弱智能体的空间理解能力。为此,我们提出一种新的基于地图的预训练范式,该范式具有空间感知能力,适用于VLN。具体而言,我们构建局部度量地图以显式聚合不完整观测并去除重复项,同时在全局拓扑地图中建模导航依赖关系。这种混合设计可平衡VLN对短期推理与长期规划的需求。基于该混合地图,我们设计了一个预训练框架来学习多模态地图表征,从而增强空间感知的跨模态推理,促进语言引导的导航目标实现。大量实验证明了基于地图的VLN预训练路径的有效性,所提方法在四个VLN基准测试中均达到了最先进水平。