Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization

Combining offline and online reinforcement learning (RL) is crucial for efficient and safe learning. However, previous approaches treat offline and online learning as separate procedures, resulting in redundant designs and limited performance. We ask: Can we achieve straightforward yet effective offline and online learning without introducing extra conservatism or regularization? In this study, we propose Uni-o4, which utilizes an on-policy objective for both offline and online learning. Owning to the alignment of objectives in two phases, the RL agent can transfer between offline and online learning seamlessly. This property enhances the flexibility of the learning paradigm, allowing for arbitrary combinations of pretraining, fine-tuning, offline, and online learning. In the offline phase, specifically, Uni-o4 leverages diverse ensemble policies to address the mismatch issues between the estimated behavior policy and the offline dataset. Through a simple offline policy evaluation (OPE) approach, Uni-o4 can achieve multi-step policy improvement safely. We demonstrate that by employing the method above, the fusion of these two paradigms can yield superior offline initialization as well as stable and rapid online fine-tuning capabilities. Through real-world robot tasks, we highlight the benefits of this paradigm for rapid deployment in challenging, previously unseen real-world environments. Additionally, through comprehensive evaluations using numerous simulated benchmarks, we substantiate that our method achieves state-of-the-art performance in both offline and offline-to-online fine-tuning learning. Our website: https://lei-kun.github.io/uni-o4/ .

翻译：论文摘要：结合离线与在线强化学习对实现高效且安全的学习至关重要。然而，现有方法将离线与在线学习视为独立过程，导致设计冗余且性能受限。我们提出一个问题：能否在不引入额外保守性或正则化的条件下，实现简洁高效的离线与在线学习？在本研究中，我们提出Uni-o4方法，该方法采用同策略目标函数同时适用于离线与在线学习。由于两个阶段的目标函数一致，强化学习智能体可在离线与在线学习间无缝迁移。这一特性增强了学习范式的灵活性，支持预训练、微调、离线与在线学习的任意组合。具体而言，在离线阶段，Uni-o4利用多样化集成策略来应对估计行为策略与离线数据集之间的不匹配问题。通过一种简单的离线策略评估方法，Uni-o4能够安全地实现多步策略改进。我们证明，采用上述方法融合两种范式可同时获得优越的离线初始化能力与稳定快速的在线微调性能。通过真实机器人任务，我们展示了该范式在具有挑战性的全新真实环境快速部署中的优势。此外，基于大量模拟基准的综合评估证实，我们的方法在离线学习与离线到在线微调学习任务中均达到最优性能。项目主页：https://lei-kun.github.io/uni-o4/