The offline-to-online (O2O) paradigm in reinforcement learning (RL) utilizes pre-trained models on offline datasets for subsequent online fine-tuning. However, conventional O2O RL algorithms typically require maintaining and retraining the large offline datasets to mitigate the effects of out-of-distribution (OOD) data, which limits their efficiency in exploiting online samples. To address this challenge, we introduce a new paradigm called SAMG: State-Action-Conditional Offline-to-Online Reinforcement Learning with Offline Model Guidance. In particular, rather than directly training on offline data, SAMG freezes the pre-trained offline critic to provide offline values for each state-action pair to deliver compact offline information. This framework eliminates the need for retraining with offline data by freezing and leveraging these values of the offline model. These are then incorporated with the online target critic using a Bellman equation weighted by a policy state-action-aware coefficient. This coefficient, derived from a conditional variational auto-encoder (C-VAE), aims to capture the reliability of the offline data on a state-action level. SAMG could be easily integrated with existing Q-function based O2O RL algorithms. Theoretical analysis shows good optimality and lower estimation error of SAMG. Empirical evaluations demonstrate that SAMG outperforms four state-of-the-art O2O RL algorithms in the D4RL benchmark.
翻译:强化学习中的离线到在线范式利用离线数据集上的预训练模型进行后续在线微调。然而,传统的离线到在线强化学习算法通常需要维护并重新训练大型离线数据集以缓解分布外数据的影响,这限制了其利用在线样本的效率。为解决这一挑战,我们提出了一种称为SAMG的新范式:基于离线模型引导的状态-动作条件化离线到在线强化学习。具体而言,SAMG并非直接在离线数据上训练,而是冻结预训练的离线评价器,为每个状态-动作对提供离线值以传递紧凑的离线信息。该框架通过冻结并利用离线模型的这些值,消除了重新训练离线数据的需求。随后,这些值通过策略状态-动作感知系数加权的贝尔曼方程与在线目标评价器相结合。该系数源自条件变分自编码器,旨在从状态-动作层面捕捉离线数据的可靠性。SAMG可轻松与现有基于Q函数的离线到在线强化学习算法集成。理论分析表明SAMG具有良好的最优性与更低的估计误差。实证评估证明,在D4RL基准测试中,SAMG优于四种最先进的离线到在线强化学习算法。