Offline Retraining for Online RL: Decoupled Policy Learning to Mitigate Exploration Bias

It is desirable for policies to optimistically explore new states and behaviors during online reinforcement learning (RL) or fine-tuning, especially when prior offline data does not provide enough state coverage. However, exploration bonuses can bias the learned policy, and our experiments find that naive, yet standard use of such bonuses can fail to recover a performant policy. Concurrently, pessimistic training in offline RL has enabled recovery of performant policies from static datasets. Can we leverage offline RL to recover better policies from online interaction? We make a simple observation that a policy can be trained from scratch on all interaction data with pessimistic objectives, thereby decoupling the policies used for data collection and for evaluation. Specifically, we propose offline retraining, a policy extraction step at the end of online fine-tuning in our Offline-to-Online-to-Offline (OOO) framework for reinforcement learning (RL). An optimistic (exploration) policy is used to interact with the environment, and a separate pessimistic (exploitation) policy is trained on all the observed data for evaluation. Such decoupling can reduce any bias from online interaction (intrinsic rewards, primacy bias) in the evaluation policy, and can allow more exploratory behaviors during online interaction which in turn can generate better data for exploitation. OOO is complementary to several offline-to-online RL and online RL methods, and improves their average performance by 14% to 26% in our fine-tuning experiments, achieves state-of-the-art performance on several environments in the D4RL benchmarks, and improves online RL performance by 165% on two OpenAI gym environments. Further, OOO can enable fine-tuning from incomplete offline datasets where prior methods can fail to recover a performant policy. Implementation: https://github.com/MaxSobolMark/OOO

翻译：在线强化学习或微调过程中，策略应乐观地探索新状态与行为，尤其当先验离线数据无法提供充分的状态覆盖时。然而，探索奖励易导致学习策略产生偏差，我们的实验发现：简单且标准地使用此类奖励可能无法恢复高效策略。与此同时，离线强化学习中的悲观训练已能通过静态数据集恢复高效策略。我们能否利用离线强化学习从在线交互中恢复更优策略？本文提出一个简单观察：策略可在全部交互数据上通过悲观目标从头训练，从而解耦用于数据收集与评估的策略。具体而言，我们提出离线再训练方法，作为强化学习离线-在线-离线（OOO）框架中在线微调结束时的策略提取步骤。乐观（探索）策略用于与环境交互，而独立的悲观（利用）策略则基于所有观测数据训练以用于评估。这种解耦可减少在线交互（如内在奖励、首因偏差）对评估策略的偏差，同时允许在线交互中更激进的探索行为，进而为利用策略生成更优数据。OOO可与多种离线-在线强化学习及在线强化学习方法互补，在微调实验中平均性能提升14%至26%，在D4RL基准测试的多个环境中达到最先进水平，并在两个OpenAI Gym环境中将在线强化学习性能提升165%。此外，OOO还能在先前方法无法恢复高效策略的不完整离线数据集上实现有效微调。实现代码：https://github.com/MaxSobolMark/OOO