Combining offline and online reinforcement learning (RL) is crucial for efficient and safe learning. However, previous approaches treat offline and online learning as separate procedures, resulting in redundant designs and limited performance. We ask: Can we achieve straightforward yet effective offline and online learning without introducing extra conservatism or regularization? In this study, we propose Uni-o4, which utilizes an on-policy objective for both offline and online learning. Owning to the alignment of objectives in two phases, the RL agent can transfer between offline and online learning seamlessly. This property enhances the flexibility of the learning paradigm, allowing for arbitrary combinations of pretraining, fine-tuning, offline, and online learning. In the offline phase, specifically, Uni-o4 leverages diverse ensemble policies to address the mismatch issues between the estimated behavior policy and the offline dataset. Through a simple offline policy evaluation (OPE) approach, Uni-o4 can achieve multi-step policy improvement safely. We demonstrate that by employing the method above, the fusion of these two paradigms can yield superior offline initialization as well as stable and rapid online fine-tuning capabilities. Through real-world robot tasks, we highlight the benefits of this paradigm for rapid deployment in challenging, previously unseen real-world environments. Additionally, through comprehensive evaluations using numerous simulated benchmarks, we substantiate that our method achieves state-of-the-art performance in both offline and offline-to-online fine-tuning learning. Our website: https://lei-kun.github.io/uni-o4/ .
翻译:结合离线与在线强化学习对于高效且安全的学习至关重要。然而,以往方法将离线与在线学习视为独立过程,导致设计冗余且性能受限。我们提出疑问:能否在不引入额外约束或正则化的情况下,实现简洁高效的离线与在线学习?本研究提出Uni-o4,该方法对离线与在线学习均采用在线策略目标。由于两个阶段的目标保持一致,强化学习智能体可在离线与在线学习间无缝迁移。这一特性增强了学习范式的灵活性,支持预训练、微调、离线与在线学习的任意组合。具体在离线阶段,Uni-o4利用多样化集成策略解决估计行为策略与离线数据集之间的不匹配问题。通过简单的离线策略评估方法,Uni-o4能够安全地实现多步策略改进。实验表明,采用上述方法后,两种范式的融合既能产生优越的离线初始化效果,又能实现稳定且快速的在线微调能力。通过真实机器人任务,我们验证了该范式在挑战性强的陌生真实环境中快速部署的优势。此外,基于大量模拟基准的综合评估证实,我们的方法在离线学习与离线到在线微调学习中均达到最先进性能。本论文网站:https://lei-kun.github.io/uni-o4/。