We show that discounted methods for solving continuing reinforcement learning problems can perform significantly better if they center their rewards by subtracting out the rewards' empirical average. The improvement is substantial at commonly used discount factors and increases further as the discount factor approaches one. In addition, we show that if a problem's rewards are shifted by a constant, then standard methods perform much worse, whereas methods with reward centering are unaffected. Estimating the average reward is straightforward in the on-policy setting; we propose a slightly more sophisticated method for the off-policy setting. Reward centering is a general idea, so we expect almost every reinforcement-learning algorithm to benefit by the addition of reward centering.
翻译:我们证明,通过减去奖励的经验平均值来进行奖励中心化,可以显著提升解决持续强化学习问题的折扣方法的性能。在常用折扣因子下,改进效果显著,且随着折扣因子趋近于一,改进进一步增大。此外,我们证明,若问题的奖励被一个常数偏移,标准方法性能会大幅下降,而采用奖励中心化的方法则不受影响。在策略设置下,估计平均奖励较为直接;我们提出了一种略为复杂的方法用于离策略设置。奖励中心化是一个通用思想,因此我们预期几乎所有的强化学习算法都能通过引入奖励中心化而受益。