Learning rich skills under the option framework without supervision of external rewards is at the frontier of reinforcement learning research. Existing works mainly fall into two distinctive categories: variational option discovery that maximizes the diversity of the options through a mutual information loss (while ignoring coverage) and Laplacian-based methods that focus on improving the coverage of options by increasing connectivity of the state space (while ignoring diversity). In this paper, we show that diversity and coverage in unsupervised option discovery can indeed be unified under the same mathematical framework. To be specific, we explicitly quantify the diversity and coverage of the learned options through a novel use of Determinantal Point Process (DPP) and optimize these objectives to discover options with both superior diversity and coverage. Our proposed algorithm, ODPP, has undergone extensive evaluation on challenging tasks created with Mujoco and Atari. The results demonstrate that our algorithm outperforms state-of-the-art baselines in both diversity- and coverage-driven categories.
翻译:在选项框架下无需外部奖励监督学习丰富技能,是强化学习研究的前沿领域。现有工作主要分为两类:通过互信息损失最大化选项多样性的变分选项发现方法(忽略覆盖性),以及通过增强状态空间连通性提升选项覆盖性的拉普拉斯方法(忽略多样性)。本文证明,无监督选项发现中的多样性与覆盖性确实可在同一数学框架下统一。具体而言,我们通过行列式点过程(DPP)的创新应用显式量化所学技能的多样性与覆盖性,并优化这些目标以发现兼具优越多样性与覆盖性的技能。所提算法ODPP在Mujoco和Atari构建的挑战性任务中经过广泛评估,结果表明我们的算法在多样性驱动和覆盖性驱动两类任务中均优于现有最先进基线方法。