Offline reinforcement learning (RL) is a learning paradigm where an agent learns from a fixed dataset of experience. However, learning solely from a static dataset can limit the performance due to the lack of exploration. To overcome it, offline-to-online RL combines offline pre-training with online fine-tuning, which enables the agent to further refine its policy by interacting with the environment in real-time. Despite its benefits, existing offline-to-online RL methods suffer from performance degradation and slow improvement during the online phase. To tackle these challenges, we propose a novel framework called Ensemble-based Offline-to-Online (E2O) RL. By increasing the number of Q-networks, we seamlessly bridge offline pre-training and online fine-tuning without degrading performance. Moreover, to expedite online performance enhancement, we appropriately loosen the pessimism of Q-value estimation and incorporate ensemble-based exploration mechanisms into our framework. Experimental results demonstrate that E2O can substantially improve the training stability, learning efficiency, and final performance of existing offline RL methods during online fine-tuning on a range of locomotion and navigation tasks, significantly outperforming existing offline-to-online RL methods.
翻译:离线强化学习(RL)是一种学习范式,其中智能体从固定的经验数据集中学习。然而,仅从静态数据集学习可能由于缺乏探索而限制性能。为克服这一问题,离线到在线强化学习将离线预训练与在线微调相结合,使智能体能够通过与环境的实时交互进一步优化其策略。尽管有其优势,现有的离线到在线强化学习方法在在线阶段存在性能下降和提升缓慢的问题。为应对这些挑战,我们提出了一种新颖的框架,称为基于集成的离线到在线(E2O)强化学习。通过增加Q网络数量,我们无缝地桥接了离线预训练与在线微调,且不会造成性能下降。此外,为加速在线性能提升,我们适当放宽Q值估计的悲观性,并将基于集成的探索机制纳入框架中。实验结果表明,在多种运动与导航任务的在线微调过程中,E2O能够显著提升现有离线强化学习方法的训练稳定性、学习效率和最终性能,大幅优于现有的离线到在线强化学习方法。