Being able to perceive the semantics and the spatial structure of the environment is essential for visual navigation of a household robot. However, most existing works only employ visual backbones pre-trained either with independent images for classification or with self-supervised learning methods to adapt to the indoor navigation domain, neglecting the spatial relationships that are essential to the learning of navigation. Inspired by the behavior that humans naturally build semantically and spatially meaningful cognitive maps in their brains during navigation, in this paper, we propose a novel navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps (Ego$^2$-Map). We apply the visual transformer as the backbone encoder and train the model with data collected from the large-scale Habitat-Matterport3D environments. Ego$^2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation. Experiments show that agents using our learned representations on object-goal navigation outperform recent visual pre-training methods. Moreover, our representations significantly improve vision-and-language navigation in continuous environments for both high-level and low-level action spaces, achieving new state-of-the-art results of 47% SR and 41% SPL on the test server.
翻译:感知环境语义与空间结构对于家用机器人的视觉导航至关重要。然而,现有工作大多采用预训练的视觉骨干网络,或基于独立图像分类,或通过自监督学习方法适配室内导航场景,却忽略了导航学习中至关重要的空间关系。受人类在导航过程中大脑自然构建语义与空间认知地图行为的启发,本文提出一种新颖的导航专用视觉表示学习方法,通过对比智能体自我中心视图与语义地图(Ego$^2$-Map)进行学习。我们采用视觉Transformer作为骨干编码器,利用大规模 Habitat-Matterport3D 环境采集的数据训练模型。Ego$^2$-Map 学习将地图中紧凑而丰富的信息(如物体、结构及场景转换)迁移至智能体的自我中心表征中,以辅助导航。实验表明,使用该学习表示的目标导向导航智能体性能优于近期视觉预训练方法。此外,该表示在连续环境的高层与低层动作空间中均显著提升了视觉-语言导航性能,在测试服务器上以47%的SR和41%的SPL达到最新最优结果。