Image goal navigation requires two different skills: firstly, core navigation skills, including the detection of free space and obstacles, and taking decisions based on an internal representation; and secondly, computing directional information by comparing visual observations to the goal image. Current state-of-the-art methods either rely on dedicated image-matching, or pre-training of computer vision modules on relative pose estimation. In this paper, we study whether this task can be efficiently solved with end-to-end training of full agents with RL, as has been claimed by recent work. A positive answer would have impact beyond Embodied AI and allow training of relative pose estimation from reward for navigation alone. In this large experimental study we investigate the effect of architectural choices like late fusion, channel stacking, space-to-depth projections and cross-attention, and their role in the emergence of relative pose estimators from navigation training. We show that the success of recent methods is influenced up to a certain extent by simulator settings, leading to shortcuts in simulation. However, we also show that these capabilities can be transferred to more realistic setting, up to some extent. We also find evidence for correlations between navigation performance and probed (emerging) relative pose estimation performance, an important sub skill.
翻译:图像目标导航需要两种不同的技能:首先是核心导航技能,包括自由空间和障碍物的检测,以及基于内部表征进行决策;其次是通过将视觉观测与目标图像进行比较来计算方向信息。当前最先进的方法要么依赖于专用的图像匹配,要么依赖于在相对位姿估计任务上对计算机视觉模块进行预训练。在本文中,我们研究该任务是否能够像近期工作所声称的那样,通过对完整智能体进行端到端的强化学习训练来高效解决。一个肯定的答案将产生超越具身人工智能的影响,并允许仅从导航奖励中训练相对位姿估计。在这项大规模实验研究中,我们调查了诸如晚期融合、通道堆叠、空间到深度投影和交叉注意力等架构选择的影响,以及它们在导航训练中促使相对位姿估计器涌现的作用。我们表明,近期方法的成功在一定程度上受到模拟器设置的影响,导致了模拟中的捷径。然而,我们也证明这些能力可以在一定程度上迁移到更现实的设置中。我们还发现了导航性能与被探测(涌现的)相对位姿估计性能之间存在相关性的证据,后者是一项重要的子技能。