Offline-to-online reinforcement learning (RL) is a training paradigm that combines pre-training on a pre-collected dataset with fine-tuning in an online environment. However, the incorporation of online fine-tuning can intensify the well-known distributional shift problem. Existing solutions tackle this problem by imposing a policy constraint on the policy improvement objective in both offline and online learning. They typically advocate a single balance between policy improvement and constraints across diverse data collections. This one-size-fits-all manner may not optimally leverage each collected sample due to the significant variation in data quality across different states. To this end, we introduce Family Offline-to-Online RL (FamO2O), a simple yet effective framework that empowers existing algorithms to determine state-adaptive improvement-constraint balances. FamO2O utilizes a universal model to train a family of policies with different improvement/constraint intensities, and a balance model to select a suitable policy for each state. Theoretically, we prove that state-adaptive balances are necessary for achieving a higher policy performance upper bound. Empirically, extensive experiments show that FamO2O offers a statistically significant improvement over various existing methods, achieving state-of-the-art performance on the D4RL benchmark. Codes are available at https://github.com/LeapLabTHU/FamO2O.
翻译:离线到在线强化学习是一种训练范式,它结合了在预收集数据集上的预训练与在线环境中的微调。然而,引入在线微调会加剧著名的分布偏移问题。现有解决方案通过在离线与在线学习中对策略改进目标施加策略约束来应对此问题。他们通常主张在多样化的数据收集过程中,使用单一的策略改进与约束平衡。这种“一刀切”的方式可能因不同状态间数据质量的显著差异而无法最优利用每个收集样本。为此,我们提出Family Offline-to-Online RL(FamO2O),一种简单而有效的框架,使现有算法能够确定状态自适应的改进-约束平衡。FamO2O利用通用模型训练一组具有不同改进/约束强度的策略族,并通过平衡模型为每个状态选择合适的策略。理论上,我们证明了状态自适应平衡是实现更高策略性能上限的必要条件。实验上,大量实验表明FamO2O相较于多种现有方法具有统计显著的提升,在D4RL基准上达到了最先进性能。代码见 https://github.com/LeapLabTHU/FamO2O。