Robust reinforcement learning agents using high-dimensional observations must be able to identify relevant state features amidst many exogeneous distractors. A representation that captures controllability identifies these state elements by determining what affects agent control. While methods such as inverse dynamics and mutual information capture controllability for a limited number of timesteps, capturing long-horizon elements remains a challenging problem. Myopic controllability can capture the moment right before an agent crashes into a wall, but not the control-relevance of the wall while the agent is still some distance away. To address this we introduce action-bisimulation encoding, a method inspired by the bisimulation invariance pseudometric, that extends single-step controllability with a recursive invariance constraint. By doing this, action-bisimulation learns a multi-step controllability metric that smoothly discounts distant state features that are relevant for control. We demonstrate that action-bisimulation pretraining on reward-free, uniformly random data improves sample efficiency in several environments, including a photorealistic 3D simulation domain, Habitat. Additionally, we provide theoretical analysis and qualitative results demonstrating the information captured by action-bisimulation.
翻译:使用高维观测的鲁棒强化学习智能体必须能够在众多外生干扰因素中识别出相关的状态特征。一种能够捕捉可控性的表示方法通过确定影响智能体控制的因素来识别这些状态元素。虽然逆动力学和互信息等方法能够捕捉有限时间步内的可控性,但捕捉长时域元素仍然是一个具有挑战性的问题。短视可控性可以捕捉智能体撞墙前的瞬间状态,却无法在智能体尚有一定距离时识别墙壁的控制相关性。为此,我们提出动作双模拟编码方法——一种受双模拟不变性伪度量启发的技术,通过递归不变性约束扩展了单步可控性。该方法学习到的多步可控性度量能够对远距离控制相关状态特征进行平滑折现。我们证明,在包括光真实感3D仿真环境Habitat在内的多个场景中,基于无奖励均匀随机数据的动作双模拟预训练能有效提升样本效率。此外,我们通过理论分析和定性实验结果展示了动作双模拟方法所捕获的信息特征。