We introduce a new maximum entropy reinforcement learning framework based on the distribution of states and actions visited by a policy. More precisely, an intrinsic reward function is added to the reward function of the Markov decision process that shall be controlled. For each state and action, this intrinsic reward is the relative entropy of the discounted distribution of states and actions (or features from these states and actions) visited during the next time steps. We first prove that an optimal exploration policy, which maximizes the expected discounted sum of intrinsic rewards, is also a policy that maximizes a lower bound on the state-action value function of the decision process under some assumptions. We also prove that the visitation distribution used in the intrinsic reward definition is the fixed point of a contraction operator. Following, we describe how to adapt existing algorithms to learn this fixed point and compute the intrinsic rewards to enhance exploration. A new practical off-policy maximum entropy reinforcement learning algorithm is finally introduced. Empirically, exploration policies have good state-action space coverage, and high-performing control policies are computed efficiently.
翻译:本文提出了一种基于策略访问状态与动作分布的新型最大熵强化学习框架。具体而言,在待控制的马尔可夫决策过程的奖励函数中引入了一个内在奖励函数。对于每个状态和动作,该内在奖励表示未来时间步内访问的状态与动作(或其特征)的折扣分布相对熵。我们首先证明:在特定假设下,最大化期望折扣内在奖励之和的最优探索策略,同时也是最大化决策过程状态-动作价值函数下界的策略。此外,我们证明了内在奖励定义中使用的访问分布是收缩算子的不动点。基于此,我们阐述了如何改进现有算法以学习该不动点并计算内在奖励来增强探索。最后提出了一种新的实用离策略最大熵强化学习算法。实验表明,探索策略具有良好的状态-动作空间覆盖性,并能高效计算出高性能控制策略。