There has been significant recent progress in the area of unsupervised skill discovery, utilizing various information-theoretic objectives as measures of diversity. Despite these advances, challenges remain: current methods require significant online interaction, fail to leverage vast amounts of available task-agnostic data and typically lack a quantitative measure of skill utility. We address these challenges by proposing a principled offline algorithm for unsupervised skill discovery that, in addition to maximizing diversity, ensures that each learned skill imitates state-only expert demonstrations to a certain degree. Our main analytical contribution is to connect Fenchel duality, reinforcement learning, and unsupervised skill discovery to maximize a mutual information objective subject to KL-divergence state occupancy constraints. Furthermore, we demonstrate the effectiveness of our method on the standard offline benchmark D4RL and on a custom offline dataset collected from a 12-DoF quadruped robot for which the policies trained in simulation transfer well to the real robotic system.
翻译:在无监督技能发现领域,近期已取得显著进展,研究者利用各种信息论目标作为多样性的度量标准。尽管存在这些进展,挑战依然存在:现有方法需要大量在线交互,未能充分利用海量可用的任务无关数据,且通常缺乏对技能效用的定量评估。我们通过提出一种原则性的离线无监督技能发现算法来解决这些挑战,该算法除了最大化多样性外,还能确保每个习得的技能在一定程度上模仿仅基于状态的专家示范。我们的主要理论贡献在于将Fenchel对偶、强化学习和无监督技能发现相结合,以在KL散度状态占用约束下最大化互信息目标。此外,我们在标准离线基准D4RL以及从一个12自由度四足机器人收集的自定义离线数据集上验证了方法的有效性,其中在仿真中训练的策略能够很好地迁移到真实机器人系统中。