Agents navigating in 3D environments require some form of memory, which should hold a compact and actionable representation of the history of observations useful for decision taking and planning. In most end-to-end learning approaches the representation is latent and usually does not have a clearly defined interpretation, whereas classical robotics addresses this with scene reconstruction resulting in some form of map, usually estimated with geometry and sensor models and/or learning. In this work we propose to learn an actionable representation of the scene independently of the targeted downstream task and without explicitly optimizing reconstruction. The learned representation is optimized by a blind auxiliary agent trained to navigate with it on multiple short sub episodes branching out from a waypoint and, most importantly, without any direct visual observation. We argue and show that the blindness property is important and forces the (trained) latent representation to be the only means for planning. With probing experiments we show that the learned representation optimizes navigability and not reconstruction. On downstream tasks we show that it is robust to changes in distribution, in particular the sim2real gap, which we evaluate with a real physical robot in a real office building, significantly improving performance.
翻译:在三维环境中导航的智能体需要某种形式的记忆,这种记忆应能紧凑且可操作地表示观测历史,以辅助决策与规划。绝大多数端到端学习方法中的表征是隐性的,通常缺乏明确的可解释性;而经典机器人技术则通过场景重建来解决该问题,形成某种形式的(基于几何模型、传感器模型和/或学习的)地图估计。本文提出一种与下游任务无关且无需显式优化重建的场景可操作表征学习方法。该学习表征通过一个盲态辅助智能体进行优化——该智能体利用该表征在从路径点出发的多个短程子片段中导航,且最关键的是,不依赖任何直接视觉观测。我们论证并验证了"盲态"属性的重要性:它迫使隐表征成为规划的唯一依据。通过探针实验表明,该表征优化的是导航能力而非重建能力。在下游任务中,该表征对分布偏移(特别是Sim2Real差异)具有鲁棒性:我们在一栋真实办公大楼中使用实体机器人进行评估,结果表明其性能得到了显著提升。