Offline-to-online reinforcement learning (RL), by combining the benefits of offline pretraining and online finetuning, promises enhanced sample efficiency and policy performance. However, existing methods, effective as they are, suffer from suboptimal performance, limited adaptability, and unsatisfactory computational efficiency. We propose a novel framework, PROTO, which overcomes the aforementioned limitations by augmenting the standard RL objective with an iteratively evolving regularization term. Performing a trust-region-style update, PROTO yields stable initial finetuning and optimal final performance by gradually evolving the regularization term to relax the constraint strength. By adjusting only a few lines of code, PROTO can bridge any offline policy pretraining and standard off-policy RL finetuning to form a powerful offline-to-online RL pathway, birthing great adaptability to diverse methods. Simple yet elegant, PROTO imposes minimal additional computation and enables highly efficient online finetuning. Extensive experiments demonstrate that PROTO achieves superior performance over SOTA baselines, offering an adaptable and efficient offline-to-online RL framework.
翻译:摘要:离线到在线强化学习(RL),通过结合离线预训练与在线微调的优势,有望提升样本效率与策略性能。然而,现有方法虽效果显著,却存在性能次优、适应性有限及计算效率不足等问题。我们提出了一种新颖框架PROTO,通过向标准RL目标函数中引入迭代演化的正则化项,克服了上述局限。PROTO采用信任区域风格的更新,通过逐步调整正则化项以放松约束强度,从而实现稳定的初始微调与优化的最终性能。仅需修改少量代码行,PROTO即可将任意离线策略预训练与标准离策略RL微调无缝衔接,形成强大的离线到在线RL通路,展现出对多种方法的卓越适应性。简洁而精巧的设计使PROTO几乎不增加额外计算开销,并实现高效的在线微调。大量实验表明,PROTO在性能上超越当前最优基线,为离线到在线RL提供了一种兼具适应性与高效性的框架。