Offline Goal-Conditioned Reinforcement Learning (GCRL) is tasked with learning to achieve multiple goals in an environment purely from offline datasets using sparse reward functions. Offline GCRL is pivotal for developing generalist agents capable of leveraging pre-existing datasets to learn diverse and reusable skills without hand-engineering reward functions. However, contemporary approaches to GCRL based on supervised learning and contrastive learning are often suboptimal in the offline setting. An alternative perspective on GCRL optimizes for occupancy matching, but necessitates learning a discriminator, which subsequently serves as a pseudo-reward for downstream RL. Inaccuracies in the learned discriminator can cascade, negatively influencing the resulting policy. We present a novel approach to GCRL under a new lens of mixture-distribution matching, leading to our discriminator-free method: SMORe. The key insight is combining the occupancy matching perspective of GCRL with a convex dual formulation to derive a learning objective that can better leverage suboptimal offline data. SMORe learns scores or unnormalized densities representing the importance of taking an action at a state for reaching a particular goal. SMORe is principled and our extensive experiments on the fully offline GCRL benchmark composed of robot manipulation and locomotion tasks, including high-dimensional observations, show that SMORe can outperform state-of-the-art baselines by a significant margin.
翻译:离线目标条件强化学习(GCRL)的任务是利用稀疏奖励函数,仅从离线数据集中学习在环境中实现多个目标。离线GCRL对于开发通用型智能体至关重要,这类智能体能够利用预先存在的数据集学习多样化且可重复使用的技能,而无需手动设计奖励函数。然而,当前基于监督学习和对比学习的GCRL方法在离线场景中往往表现欠佳。另一种GCRL视角优化了占用匹配,但需要学习一个判别器,该判别器随后作为下游强化学习的伪奖励。学习到的判别器中的不准确性可能级联放大,对最终策略产生负面影响。我们提出了一种基于混合分布匹配新视角的GCRL方法,形成了无需判别器的方案:SMORe。其核心思想是将GCRL的占用匹配视角与凸对偶公式相结合,推导出一个能够更好地利用次优离线数据的学习目标。SMORe学习得分(即未归一化的密度),用于表示在某个状态采取某个动作对于达到特定目标的重要性。SMORe具有理论原则性,我们在完全离线GCRL基准(包括机器人操作和移动任务,涵盖高维观测)上的大量实验表明,SMORe能够显著超越最先进的基线方法。