Agents navigating in 3D environments require some form of memory, which should hold a compact and actionable representation of the history of observations useful for decision taking and planning. In most end-to-end learning approaches the representation is latent and usually does not have a clearly defined interpretation, whereas classical robotics addresses this with scene reconstruction resulting in some form of map, usually estimated with geometry and sensor models and/or learning. In this work we propose to learn an actionable representation of the scene independently of the targeted downstream task and without explicitly optimizing reconstruction. The learned representation is optimized by a blind auxiliary agent trained to navigate with it on multiple short sub episodes branching out from a waypoint and, most importantly, without any direct visual observation. We argue and show that the blindness property is important and forces the (trained) latent representation to be the only means for planning. With probing experiments we show that the learned representation optimizes navigability and not reconstruction. On downstream tasks we show that it is robust to changes in distribution, in particular the sim2real gap, which we evaluate with a real physical robot in a real office building, significantly improving performance.
翻译:在三维环境中导航的智能体需要某种形式的记忆,这种记忆应保存紧凑且可操作的历史观测表征,以辅助决策制定与路径规划。在大多数端到端学习方法中,表征是潜在的且通常缺乏明确的可解释性,而经典机器人学则通过场景重建(通常利用几何与传感器模型或学习方法估计的地图)来解决这一问题。本研究提出独立于下游目标任务、不显式优化重建的场景可操作表征学习方法。该学习表征通过一个盲从辅助智能体进行优化——该智能体需依赖此表征在从路径点分支出的多个短子序列中完成导航,且最关键的是,它不依赖任何直接视觉观测。我们论证并证明盲从特性的重要性,它迫使(训练好的)潜在表征成为规划的唯一依据。通过探测实验,我们证明该学习表征优化的是可导航性而非重建质量。在下游任务中,我们验证了其对分布变化的鲁棒性,特别是仿真到现实(sim2real)的差距——通过实体机器人在真实办公楼中的评估证实,该方法显著提升了性能。