In reinforcement learning, unsupervised skill discovery aims to learn diverse skills without extrinsic rewards. Previous methods discover skills by maximizing the mutual information (MI) between states and skills. However, such an MI objective tends to learn simple and static skills and may hinder exploration. In this paper, we propose a novel unsupervised skill discovery method through contrastive learning among behaviors, which makes the agent produce similar behaviors for the same skill and diverse behaviors for different skills. Under mild assumptions, our objective maximizes the MI between different behaviors based on the same skill, which serves as an upper bound of the previous MI objective. Meanwhile, our method implicitly increases the state entropy to obtain better state coverage. We evaluate our method on challenging mazes and continuous control tasks. The results show that our method generates diverse and far-reaching skills, and also obtains competitive performance in downstream tasks compared to the state-of-the-art methods.
翻译:摘要:在强化学习中,无监督技能发现旨在无需外部奖励的情况下学习多样化技能。现有方法通过最大化状态与技能之间的互信息来发现技能。然而,这种互信息目标倾向于学习简单且静态的技能,可能阻碍智能体的探索行为。本文提出一种基于行为对比学习的无监督技能发现新方法,该方法使智能体对同一技能产生相似行为,对不同技能产生多样化行为。在温和假设条件下,我们的目标函数通过最大化基于同一技能的不同行为间的互信息,为传统互信息目标函数提供了上界。同时,该方法通过隐式增加状态熵实现更优的状态覆盖。我们在具有挑战性的迷宫环境和连续控制任务上进行了评估,结果表明本方法能够生成多样化且具有远见性的技能,并在下游任务中取得与当前最优方法相当的竞争性性能。