Maximum entropy reinforcement learning motivates agents to explore states and actions to maximize the entropy of some distribution, typically by providing additional intrinsic rewards proportional to that entropy function. In this paper, we study intrinsic rewards proportional to the entropy of the discounted distribution of state-action features visited during future time steps. This approach is motivated by two results. First, we show that the expected sum of these intrinsic rewards is a lower bound on the entropy of the discounted distribution of state-action features visited in trajectories starting from the initial states, which we relate to an alternative maximum entropy objective. Second, we show that the distribution used in the intrinsic reward definition is the fixed point of a contraction operator and can therefore be estimated off-policy. Experiments highlight that the new objective leads to improved visitation of features within individual trajectories, in exchange for slightly reduced visitation of features in expectation over different trajectories, as suggested by the lower bound. It also leads to improved convergence speed for learning exploration-only agents. Control performance remains similar across most methods on the considered benchmarks.
翻译:最大熵强化学习通过提供与熵函数成正比的额外内在奖励,激励智能体探索状态和动作以最大化某种分布的熵。本文研究了一种与未来时间步中访问的状态-动作特征的折扣分布熵成正比的内在奖励。该方法的动机源于两个结论。首先,我们证明这些内在奖励的期望和是从初始状态出发的轨迹中访问的状态-动作特征折扣分布熵的下界,该下界与一种替代的最大熵目标相关。其次,我们证明内在奖励定义中使用的分布是某个压缩算子的不动点,因此可以通过离策略方法进行估计。实验表明,新目标在略微降低不同轨迹间特征期望访问量的代价下(如下界所示),提升了单条轨迹内特征的访问效果。同时,该目标还加快了学习探索型智能体的收敛速度。在所考虑的基准测试中,大多数方法的控制性能保持相似。