Many classic Reinforcement Learning (RL) algorithms rely on a Bellman operator, which involves an expectation over the next states, leading to the concept of bootstrapping. To introduce a form of pessimism, we propose to replace this expectation with an expectile. In practice, this can be very simply done by replacing the $L_2$ loss with a more general expectile loss for the critic. Introducing pessimism in RL is desirable for various reasons, such as tackling the overestimation problem (for which classic solutions are double Q-learning or the twin-critic approach of TD3) or robust RL (where transitions are adversarial). We study empirically these two cases. For the overestimation problem, we show that the proposed approach, ExpectRL, provides better results than a classic twin-critic. On robust RL benchmarks, involving changes of the environment, we show that our approach is more robust than classic RL algorithms. We also introduce a variation of ExpectRL combined with domain randomization which is competitive with state-of-the-art robust RL agents. Eventually, we also extend \ExpectRL with a mechanism for choosing automatically the expectile value, that is the degree of pessimism
翻译:许多经典的强化学习(RL)算法依赖于贝尔曼算子,该算子涉及对下一状态的期望,从而引出自助法的概念。为引入一种悲观主义形式,我们提出用期望分位数替代这一期望。在实践中,这可以通过将评论家的$L_2$损失替换为更一般的期望分位数损失来实现。在强化学习中引入悲观主义出于多种原因是可取的,例如解决高估问题(经典解决方案为双Q学习或TD3的双评论家方法)或鲁棒强化学习(其中状态转移具有对抗性)。我们通过实验研究了这两种情况。针对高估问题,我们表明所提出的方法ExpectRL相比经典双评论家方法能提供更好的结果。在涉及环境变化的鲁棒强化学习基准测试中,我们证明我们的方法比经典强化学习算法更具鲁棒性。我们还提出了一种结合领域随机化的ExpectRL变体,其性能可与最先进的鲁棒强化学习智能体相媲美。最后,我们通过自动选择期望分位数值(即悲观程度)的机制扩展了ExpectRL。