HIQL: Offline Goal-Conditioned RL with Latent States as Actions

Unsupervised pre-training has recently become the bedrock for computer vision and natural language processing. In reinforcement learning (RL), goal-conditioned RL can potentially provide an analogous self-supervised approach for making use of large quantities of unlabeled (reward-free) data. However, building effective algorithms for goal-conditioned RL that can learn directly from diverse offline data is challenging, because it is hard to accurately estimate the exact value function for faraway goals. Nonetheless, goal-reaching problems exhibit structure, such that reaching distant goals entails first passing through closer subgoals. This structure can be very useful, as assessing the quality of actions for nearby goals is typically easier than for more distant goals. Based on this idea, we propose a hierarchical algorithm for goal-conditioned RL from offline data. Using one action-free value function, we learn two policies that allow us to exploit this structure: a high-level policy that treats states as actions and predicts (a latent representation of) a subgoal and a low-level policy that predicts the action for reaching this subgoal. Through analysis and didactic examples, we show how this hierarchical decomposition makes our method robust to noise in the estimated value function. We then apply our method to offline goal-reaching benchmarks, showing that our method can solve long-horizon tasks that stymie prior methods, can scale to high-dimensional image observations, and can readily make use of action-free data. Our code is available at https://seohong.me/projects/hiql/

翻译：无监督预训练近年来已成为计算机视觉和自然语言处理的基础技术。在强化学习（RL）中，目标条件强化学习可提供一种类似的自我监督方法，用于利用大量无标签（无奖励）数据。然而，构建能从多样化离线数据中直接学习的高效目标条件强化学习算法颇具挑战性，因为准确估计遥远目标的精确价值函数十分困难。尽管如此，目标达成问题具有结构性特征，即实现遥远目标需要先经过更接近的子目标。这种结构极具价值，因为评估近处目标的动作质量通常比评估远处目标更容易。基于这一思想，我们提出了一种面向离线数据的目标条件强化学习层次化算法。通过使用一个无动作价值函数，我们学习两个策略来利用这种结构：顶层策略将状态视为动作，预测（一个潜在表征的）子目标；底层策略则预测达成该子目标所需的动作。通过理论分析和示例说明，我们展示了这种层次化分解如何使我们的方法对价值函数估计噪声具有鲁棒性。随后，我们将该方法应用于离线目标达成基准测试，结果表明我们的方法能够解决先前方法难以处理的长期任务，可扩展至高维图像观测，并能有效利用无动作数据。我们的代码已在 https://seohong.me/projects/hiql/ 开源。