Entropy Regularisation is a widely adopted technique that enhances policy optimisation performance and stability. A notable form of entropy regularisation is augmenting the objective with an entropy term, thereby simultaneously optimising the expected return and the entropy. This framework, known as maximum entropy reinforcement learning (MaxEnt RL), has shown theoretical and empirical successes. However, its practical application in straightforward on-policy actor-critic settings remains surprisingly underexplored. We hypothesise that this is due to the difficulty of managing the entropy reward in practice. This paper proposes a simple method of separating the entropy objective from the MaxEnt RL objective, which facilitates the implementation of MaxEnt RL in on-policy settings. Our empirical evaluations demonstrate that extending Proximal Policy Optimisation (PPO) and Trust Region Policy Optimisation (TRPO) within the MaxEnt framework improves policy optimisation performance in both MuJoCo and Procgen tasks. Additionally, our results highlight MaxEnt RL's capacity to enhance generalisation.
翻译:熵正则化是一种被广泛采用的技术,旨在提升策略优化的性能与稳定性。其中一种显著形式是在目标函数中增加熵项,从而同时优化期望回报与熵值。这一框架被称为最大熵强化学习(MaxEnt RL),已在理论与实证层面取得显著成功。然而,其在直接的同策略行动者-评论家设定中的实际应用仍出人意料地缺乏深入探索。我们推测这源于实践中管理熵奖励的困难。本文提出一种将熵目标从MaxEnt RL目标中分离的简单方法,从而促进了MaxEnt RL在同策略设定中的实现。我们的实证评估表明,在MaxEnt框架下扩展近端策略优化(PPO)与信赖域策略优化(TRPO),能够提升在MuJoCo和Procgen任务中的策略优化性能。此外,我们的结果突显了MaxEnt RL在增强泛化能力方面的潜力。