ROER: Regularized Optimal Experience Replay

Experience replay serves as a key component in the success of online reinforcement learning (RL). Prioritized experience replay (PER) reweights experiences by the temporal difference (TD) error empirically enhancing the performance. However, few works have explored the motivation of using TD error. In this work, we provide an alternative perspective on TD-error-based reweighting. We show the connections between the experience prioritization and occupancy optimization. By using a regularized RL objective with $f-$divergence regularizer and employing its dual form, we show that an optimal solution to the objective is obtained by shifting the distribution of off-policy data in the replay buffer towards the on-policy optimal distribution using TD-error-based occupancy ratios. Our derivation results in a new pipeline of TD error prioritization. We specifically explore the KL divergence as the regularizer and obtain a new form of prioritization scheme, the regularized optimal experience replay (ROER). We evaluate the proposed prioritization scheme with the Soft Actor-Critic (SAC) algorithm in continuous control MuJoCo and DM Control benchmark tasks where our proposed scheme outperforms baselines in 6 out of 11 tasks while the results of the rest match with or do not deviate far from the baselines. Further, using pretraining, ROER achieves noticeable improvement on difficult Antmaze environment where baselines fail, showing applicability to offline-to-online fine-tuning. Code is available at \url{https://github.com/XavierChanglingLi/Regularized-Optimal-Experience-Replay}.

翻译：经验回放是在线强化学习（RL）取得成功的关键组成部分。基于时序差分（TD）误差的优先级经验回放（PER）通过重新加权经验，在经验上提升了性能。然而，很少有研究探讨使用TD误差的内在动机。本文从另一种视角重新审视基于TD误差的重新加权方法。我们揭示了经验优先级与占用度优化之间的联系。通过采用带有$f-$散度正则化器的正则化RL目标函数并利用其对偶形式，我们证明：通过基于TD误差的占用度比率将回放缓冲区中离策略数据的分布向最优在策略分布偏移，即可获得该目标函数的最优解。我们的推导形成了一种新的TD误差优先级处理流程。我们特别探索了以KL散度作为正则化器的情况，得到了一种新形式的优先级方案——正则化最优经验回放（ROER）。我们在连续控制任务MuJoCo和DM Control基准测试中，使用软演员-评论家（SAC）算法评估了所提出的优先级方案。实验结果表明，在11项任务中，我们的方案在6项任务上优于基线方法，其余任务的结果与基线持平或未显著偏离。此外，通过预训练，ROER在基线方法失效的困难Antmaze环境中取得了显著改进，展现了其在离线到在线微调场景中的适用性。代码发布于\url{https://github.com/XavierChanglingLi/Regularized-Optimal-Experience-Replay}。