This paper investigates two fundamental problems that arise when utilizing Intrinsic Motivation (IM) for reinforcement learning in Reward-Free Pre-Training (RFPT) tasks and Exploration with Intrinsic Motivation (EIM) tasks: 1) how to design an effective intrinsic objective in RFPT tasks, and 2) how to reduce the bias introduced by the intrinsic objective in EIM tasks. Existing IM methods suffer from static skills, limited state coverage, sample inefficiency in RFPT tasks, and suboptimality in EIM tasks. To tackle these problems, we propose \emph{Constrained Intrinsic Motivation (CIM)} for RFPT and EIM tasks, respectively: 1) CIM for RFPT maximizes the lower bound of the conditional state entropy subject to an alignment constraint on the state encoder network for efficient dynamic and diverse skill discovery and state coverage maximization; 2) CIM for EIM leverages constrained policy optimization to adaptively adjust the coefficient of the intrinsic objective to mitigate the distraction from the intrinsic objective. In various MuJoCo robotics environments, we empirically show that CIM for RFPT greatly surpasses fifteen IM methods for unsupervised skill discovery in terms of skill diversity, state coverage, and fine-tuning performance. Additionally, we showcase the effectiveness of CIM for EIM in redeeming intrinsic rewards when task rewards are exposed from the beginning. Our code is available at https://github.com/x-zheng16/CIM.
翻译:本文研究了在无奖励预训练任务和内在动机探索任务中利用内在动机进行强化学习时出现的两个基本问题:1)如何在RFPT任务中设计有效的内在目标;2)如何在EIM任务中减少内在目标引入的偏差。现有IM方法存在技能静态化、状态覆盖有限、RFPT任务样本效率低下以及EIM任务次优等问题。为解决这些问题,我们分别针对RFPT和EIM任务提出\emph{约束内在动机}方法:1)RFPT-CIM通过最大化条件状态熵的下界,并施加状态编码网络的对齐约束,以实现高效动态且多样化的技能发现与状态覆盖最大化;2)EIM-CIM利用约束策略优化自适应调整内在目标系数,以缓解内在目标带来的干扰。在多种MuJoCo机器人环境中,我们通过实验证明RFPT-CIM在技能多样性、状态覆盖和微调性能方面显著优于十五种无监督技能发现的IM方法。此外,我们还展示了当任务奖励从初始阶段暴露时,EIM-CIM在修正内在奖励方面的有效性。代码发布于https://github.com/x-zheng16/CIM。