This paper revisits the recently proposed reward centering algorithms including simple reward centering (SRC) and value-based reward centering (VRC), and points out that SRC is indeed the reward centering, while VRC is essentially Bellman error centering (BEC). Based on BEC, we provide the centered fixpoint for tabular value functions, as well as the centered TD fixpoint for linear value function approximation. We design the on-policy CTD algorithm and the off-policy CTDC algorithm, and prove the convergence of both algorithms. Finally, we experimentally validate the stability of our proposed algorithms. Bellman error centering facilitates the extension to various reinforcement learning algorithms.
翻译:本文重新审视了最近提出的奖励中心化算法,包括简单奖励中心化(SRC)和基于价值的奖励中心化(VRC),并指出SRC本质上是奖励中心化,而VRC实际上是贝尔曼误差中心化(BEC)。基于BEC,我们给出了表格值函数的中心化不动点,以及线性值函数逼近的中心化TD不动点。我们设计了同策略CTD算法和异策略CTDC算法,并证明了两种算法的收敛性。最后,我们通过实验验证了所提算法的稳定性。贝尔曼误差中心化为扩展到各种强化学习算法提供了便利。