Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization

Combining offline and online reinforcement learning (RL) is crucial for efficient and safe learning. However, previous approaches treat offline and online learning as separate procedures, resulting in redundant designs and limited performance. We ask: Can we achieve straightforward yet effective offline and online learning without introducing extra conservatism or regularization? In this study, we propose Uni-o4, which utilizes an on-policy objective for both offline and online learning. Owning to the alignment of objectives in two phases, the RL agent can transfer between offline and online learning seamlessly. This property enhances the flexibility of the learning paradigm, allowing for arbitrary combinations of pretraining, fine-tuning, offline, and online learning. In the offline phase, specifically, Uni-o4 leverages diverse ensemble policies to address the mismatch issues between the estimated behavior policy and the offline dataset. Through a simple offline policy evaluation (OPE) approach, Uni-o4 can achieve multi-step policy improvement safely. We demonstrate that by employing the method above, the fusion of these two paradigms can yield superior offline initialization as well as stable and rapid online fine-tuning capabilities. Through real-world robot tasks, we highlight the benefits of this paradigm for rapid deployment in challenging, previously unseen real-world environments. Additionally, through comprehensive evaluations using numerous simulated benchmarks, we substantiate that our method achieves state-of-the-art performance in both offline and offline-to-online fine-tuning learning. Our website: https://lei-kun.github.io/uni-o4/ .

翻译：结合离线与在线强化学习对于实现高效且安全的学习至关重要。然而，以往的方法将离线学习和在线学习视为独立过程，导致设计冗余且性能受限。我们提出一个问题：能否在不引入额外保守性或正则化的前提下，实现直接而有效的离线与在线学习？本研究提出Uni-o4方法，该框架在离线与在线学习阶段均采用在线策略优化目标。由于两个阶段的目标对齐，强化学习智能体能够无缝地在离线学习与在线学习之间迁移。这一特性增强了学习范式的灵活性，允许任意组合预训练、微调、离线学习与在线学习。具体而言，在离线阶段，Uni-o4利用多样化集成策略来缓解估计行为策略与离线数据集之间的不匹配问题。通过一种简单的离线策略评估方法，Uni-o4能够安全地实现多步策略改进。我们证明，采用上述方法融合两种范式，既能获得优异的离线初始化能力，又能实现稳定且快速的在线微调性能。通过真实机器人任务实验，我们突显了该范式在具有挑战性的未知现实环境中快速部署的优势。此外，基于大量仿真基准的全面评估，我们证实了该方法在纯离线学习及离线到在线微调学习中均达到了当前最优性能。相关资源详见项目网站：https://lei-kun.github.io/uni-o4/。