Mutual information-based reinforcement learning (RL) has been proposed as a promising framework for retrieving complex skills autonomously without a task-oriented reward function through mutual information (MI) maximization or variational empowerment. However, learning complex skills is still challenging, due to the fact that the order of training skills can largely affect sample efficiency. Inspired by this, we recast variational empowerment as curriculum learning in goal-conditioned RL with an intrinsic reward function, which we name Variational Curriculum RL (VCRL). From this perspective, we propose a novel approach to unsupervised skill discovery based on information theory, called Value Uncertainty Variational Curriculum (VUVC). We prove that, under regularity conditions, VUVC accelerates the increase of entropy in the visited states compared to the uniform curriculum. We validate the effectiveness of our approach on complex navigation and robotic manipulation tasks in terms of sample efficiency and state coverage speed. We also demonstrate that the skills discovered by our method successfully complete a real-world robot navigation task in a zero-shot setup and that incorporating these skills with a global planner further increases the performance.
翻译:基于互信息的强化学习(RL)已被提出作为一种有前途的框架,通过互信息(MI)最大化或变分赋能,无需面向任务的奖励函数即可自主获取复杂技能。然而,由于技能训练的顺序会显著影响样本效率,学习复杂技能仍然具有挑战性。受此启发,我们将变分赋能重新表述为具有内在奖励函数的目标条件强化学习中的课程学习,并将其命名为变分课程强化学习(VCRL)。基于这一视角,我们提出了一种基于信息理论的无监督技能发现新方法,称为值不确定性变分课程(VUVC)。我们证明,在正则条件下,与均匀课程相比,VUVC加速了访问状态中熵的增加。我们在复杂导航和机器人操作任务上验证了该方法在样本效率和状态覆盖速度方面的有效性。我们还展示了,我们的方法发现的技能在零样本设置下成功完成了真实世界的机器人导航任务,并且将这些技能与全局规划器结合进一步提升了性能。