METRA: Scalable Unsupervised RL with Metric-Aware Abstraction

Unsupervised pre-training strategies have proven to be highly effective in natural language processing and computer vision. Likewise, unsupervised reinforcement learning (RL) holds the promise of discovering a variety of potentially useful behaviors that can accelerate the learning of a wide array of downstream tasks. Previous unsupervised RL approaches have mainly focused on pure exploration and mutual information skill learning. However, despite the previous attempts, making unsupervised RL truly scalable still remains a major open challenge: pure exploration approaches might struggle in complex environments with large state spaces, where covering every possible transition is infeasible, and mutual information skill learning approaches might completely fail to explore the environment due to the lack of incentives. To make unsupervised RL scalable to complex, high-dimensional environments, we propose a novel unsupervised RL objective, which we call Metric-Aware Abstraction (METRA). Our main idea is, instead of directly covering the entire state space, to only cover a compact latent space $Z$ that is metrically connected to the state space $S$ by temporal distances. By learning to move in every direction in the latent space, METRA obtains a tractable set of diverse behaviors that approximately cover the state space, being scalable to high-dimensional environments. Through our experiments in five locomotion and manipulation environments, we demonstrate that METRA can discover a variety of useful behaviors even in complex, pixel-based environments, being the first unsupervised RL method that discovers diverse locomotion behaviors in pixel-based Quadruped and Humanoid. Our code and videos are available at https://seohong.me/projects/metra/

翻译：无监督预训练策略在自然语言处理和计算机视觉中已被证明极为有效。同样，无监督强化学习（RL）有望发现多种可能有用的行为，从而加速各类下游任务的学习。以往的无监督强化学习方法主要聚焦于纯探索和互信息技能学习。然而，尽管已有诸多尝试，实现无监督强化学习的真正可扩展性仍是一项重大开放挑战：纯探索方法可能难以应对具有大规模状态空间的复杂环境，因为覆盖所有可能的转移几乎不可行；而互信息技能学习方法则可能因缺乏激励而完全无法探索环境。为使无监督强化学习可扩展至复杂的高维环境，我们提出了一种全新的无监督强化学习目标——度量感知抽象（METRA）。我们的核心思路是：不直接覆盖整个状态空间，而是仅覆盖一个紧凑的潜在空间$Z$，该空间通过时间距离与状态空间$S$保持度量关联。通过学习在潜在空间中沿各个方向移动，METRA获得了一组可处理的多样化行为，这些行为能够近似覆盖状态空间，从而具备对高维环境的可扩展性。通过在五个运动与操作环境中的实验，我们证明METRA即使在基于像素的复杂环境中也能发现多种有用的行为——这是首个在基于像素的四足机器人及人形机器人场景中发现多样化运动技能的无监督强化学习方法。我们的代码与视频已开源至https://seohong.me/projects/metra/