Offline Goal-Conditioned Reinforcement Learning (GCRL) is tasked with learning to achieve multiple goals in an environment purely from offline datasets using sparse reward functions. Offline GCRL is pivotal for developing generalist agents capable of leveraging pre-existing datasets to learn diverse and reusable skills without hand-engineering reward functions. However, contemporary approaches to GCRL based on supervised learning and contrastive learning are often suboptimal in the offline setting. An alternative perspective on GCRL optimizes for occupancy matching, but necessitates learning a discriminator, which subsequently serves as a pseudo-reward for downstream RL. Inaccuracies in the learned discriminator can cascade, negatively influencing the resulting policy. We present a novel approach to GCRL under a new lens of mixture-distribution matching, leading to our discriminator-free method: SMORe. The key insight is combining the occupancy matching perspective of GCRL with a convex dual formulation to derive a learning objective that can better leverage suboptimal offline data. SMORe learns scores or unnormalized densities representing the importance of taking an action at a state for reaching a particular goal. SMORe is principled and our extensive experiments on the fully offline GCRL benchmark composed of robot manipulation and locomotion tasks, including high-dimensional observations, show that SMORe can outperform state-of-the-art baselines by a significant margin.
翻译:离线目标条件强化学习(Offline Goal-Conditioned Reinforcement Learning, GCRL)的任务是在仅依赖离线数据集的情况下,使用稀疏奖励函数学习在环境中实现多个目标。离线GCRL对于开发通用型智能体至关重要,这类智能体能够利用现有数据集学习多样化且可复用的技能,而无需手工设计奖励函数。然而,当前基于监督学习和对比学习的GCRL方法在离线场景中往往表现欠佳。另一种基于占据匹配的GCRL视角需要学习一个判别器,该判别器随后作为下游强化学习的伪奖励。学习到的判别器中的误差可能级联放大,对最终策略产生负面影响。我们提出了一种基于混合分布匹配新视角的GCRL方法,从而诞生了无需判别器的算法:SMORe。其核心思想是将GCRL的占据匹配视角与凸对偶公式相结合,推导出一个能更好利用次优离线数据的学习目标。SMORe通过学习评分函数(即未归一化的密度函数)来表示在某一状态下采取特定动作对于达成特定目标的重要性。SMORe遵循严谨原理,我们在包含机器人操作与运动任务(包括高维观测)的完整离线GCRL基准测试上的大量实验表明,SMORe能够以显著优势超越现有最先进基线方法。