Offline reinforcement learning (RL) is a learning paradigm where an agent learns from a fixed dataset of experience. However, learning solely from a static dataset can limit the performance due to the lack of exploration. To overcome it, offline-to-online RL combines offline pre-training with online fine-tuning, which enables the agent to further refine its policy by interacting with the environment in real-time. Despite its benefits, existing offline-to-online RL methods suffer from performance degradation and slow improvement during the online phase. To tackle these challenges, we propose a novel framework called Ensemble-based Offline-to-Online (E2O) RL. By increasing the number of Q-networks, we seamlessly bridge offline pre-training and online fine-tuning without degrading performance. Moreover, to expedite online performance enhancement, we appropriately loosen the pessimism of Q-value estimation and incorporate ensemble-based exploration mechanisms into our framework. Experimental results demonstrate that E2O can substantially improve the training stability, learning efficiency, and final performance of existing offline RL methods during online fine-tuning on a range of locomotion and navigation tasks, significantly outperforming existing offline-to-online RL methods.
翻译:离线强化学习是一种智能体从固定经验数据集中学习的范式。然而,完全依赖静态数据集进行学习会因缺乏探索而限制性能。为克服这一问题,离线到在线强化学习将离线预训练与在线微调相结合,使智能体能够通过实时与环境交互进一步优化策略。尽管具有优势,现有离线到在线强化学习方法在在线阶段仍存在性能退化与改进缓慢的问题。针对这些挑战,我们提出了一种名为基于集成的离线到在线(E2O)强化学习新框架。通过增加Q网络数量,我们在不降低性能的前提下无缝衔接了离线预训练与在线微调。此外,为加速在线性能提升,我们适当放宽了Q值估计的悲观性,并将基于集成的探索机制融入框架。实验结果表明,在多种运动控制与导航任务的在线微调过程中,E2O能显著提升现有离线强化学习方法的训练稳定性、学习效率及最终性能,且大幅优于现有离线到在线强化学习方法。