Image goal navigation requires two different skills: firstly, core navigation skills, including the detection of free space and obstacles, and taking decisions based on an internal representation; and secondly, computing directional information by comparing visual observations to the goal image. Current state-of-the-art methods either rely on dedicated image-matching, or pre-training of computer vision modules on relative pose estimation. In this paper, we study whether this task can be efficiently solved with end-to-end training of full agents with RL, as has been claimed by recent work. A positive answer would have impact beyond Embodied AI and allow training of relative pose estimation from reward for navigation alone. In this large experimental study we investigate the effect of architectural choices like late fusion, channel stacking, space-to-depth projections and cross-attention, and their role in the emergence of relative pose estimators from navigation training. We show that the success of recent methods is influenced up to a certain extent by simulator settings, leading to shortcuts in simulation. However, we also show that these capabilities can be transferred to more realistic setting, up to some extent. We also find evidence for correlations between navigation performance and probed (emerging) relative pose estimation performance, an important sub skill.
翻译:图像目标导航需要两种不同的技能:首先是核心导航技能,包括自由空间与障碍物的检测,以及基于内部表征进行决策;其次是通过将视觉观测与目标图像进行比较来计算方向信息。当前最先进的方法要么依赖于专用的图像匹配技术,要么依赖于在相对位姿估计任务上预训练的计算机视觉模块。本文研究了该任务是否能够像近期工作所声称的那样,通过强化学习对完整智能体进行端到端训练来高效解决。若能得到肯定答案,其影响将超越具身人工智能领域,并使得仅从导航奖励中训练相对位姿估计成为可能。在这项大规模实验研究中,我们探讨了延迟融合、通道堆叠、空间到深度投影和交叉注意力等架构选择的影响,以及它们在导航训练中促使相对位姿估计器涌现的作用。研究表明,近期方法的成功在一定程度上受到仿真器设置的影响,导致在仿真中出现捷径效应。然而,我们也证明这些能力可以在一定程度上迁移到更现实的场景中。此外,我们发现导航性能与被探测(涌现的)相对位姿估计性能之间存在相关性,后者是该任务的一项重要子技能。