Combining offline and online reinforcement learning (RL) is crucial for efficient and safe learning. However, previous approaches treat offline and online learning as separate procedures, resulting in redundant designs and limited performance. We ask: Can we achieve straightforward yet effective offline and online learning without introducing extra conservatism or regularization? In this study, we propose Uni-o4, which utilizes an on-policy objective for both offline and online learning. Owning to the alignment of objectives in two phases, the RL agent can transfer between offline and online learning seamlessly. This property enhances the flexibility of the learning paradigm, allowing for arbitrary combinations of pretraining, fine-tuning, offline, and online learning. In the offline phase, specifically, Uni-o4 leverages diverse ensemble policies to address the mismatch issues between the estimated behavior policy and the offline dataset. Through a simple offline policy evaluation (OPE) approach, Uni-o4 can achieve multi-step policy improvement safely. We demonstrate that by employing the method above, the fusion of these two paradigms can yield superior offline initialization as well as stable and rapid online fine-tuning capabilities. Through real-world robot tasks, we highlight the benefits of this paradigm for rapid deployment in challenging, previously unseen real-world environments. Additionally, through comprehensive evaluations using numerous simulated benchmarks, we substantiate that our method achieves state-of-the-art performance in both offline and offline-to-online fine-tuning learning. Our website: https://lei-kun.github.io/uni-o4/ .
翻译:结合离线和在线强化学习对于高效且安全的学习至关重要。然而,以往的方法将离线与在线学习视为独立过程,导致设计冗余且性能受限。我们提问:能否在无需引入额外保守性或正则化的前提下,实现直接而有效的离线与在线学习?在本研究中,我们提出Uni-o4,该方法在离线与在线学习中均采用在策略目标函数。由于两阶段目标的一致性,强化学习智能体能够在离线与在线学习之间无缝转换。这一特性增强了学习范式的灵活性,允许预训练、微调、离线与在线学习进行任意组合。具体而言,在离线阶段,Uni-o4利用多样化集成策略来缓解估计行为策略与离线数据集之间的不匹配问题。通过简单的离线策略评估方法,Uni-o4能够安全地实现多步策略改进。我们证明,采用上述方法融合这两种范式,既能获得优异的离线初始化能力,又能实现稳定且快速的在线微调性能。通过真实世界机器人任务,我们突显了该范式在极具挑战性的未知真实环境中快速部署的优势。此外,通过使用大量模拟基准进行全面评估,我们证实了该方法在离线学习及离线到在线微调学习中均达到了最优性能。我们的网站:https://lei-kun.github.io/uni-o4/。