Reinforcement Learning (RL) has achieved impressive results in robotics, yet high-performing pipelines remain highly task-specific, with little reuse of prior data. Offline Model-based RL (MBRL) offers greater data efficiency by training policies entirely from existing datasets, but suffers from compounding errors and distribution shift in long-horizon rollouts. Although existing methods have shown success in controlled simulation benchmarks, robustly applying them to the noisy, biased, and partially observed datasets typical of real-world robotics remains challenging. We present a principled pipeline for making offline MBRL effective on physical robots. Our RWM-U extends autoregressive world models with epistemic uncertainty estimation, enabling temporally consistent multi-step rollouts with uncertainty effectively propagated over long horizons. We combine RWM-U with MOPO-PPO, which adapts uncertainty-penalized policy optimization to the stable, on-policy PPO framework for real-world control. We evaluate our approach on diverse manipulation and locomotion tasks in simulation and on real quadruped and humanoid, training policies entirely from offline datasets. The resulting policies consistently outperform model-free and uncertainty-unaware model-based baselines, and fusing real-world data in model learning further yields robust policies that surpass online model-free baselines trained solely in simulation.
翻译:强化学习(RL)在机器人领域已取得令人瞩目的成果,但高性能的流程仍高度依赖于特定任务,对先前数据的复用极少。离线模型强化学习(MBRL)通过完全基于现有数据集训练策略,提供了更高的数据效率,但在长时域推演中受到复合误差和分布偏移的影响。尽管现有方法在受控的仿真基准测试中已显示出成功,但将其稳健地应用于真实世界机器人领域典型的噪声、有偏且部分可观测的数据集仍具挑战性。我们提出了一种使离线MBRL在物理机器人上有效的原则性流程。我们的RWM-U通过认知不确定性估计扩展了自回归世界模型,使得能够进行时间一致的多步推演,并将不确定性在长时域上有效传播。我们将RWM-U与MOPO-PPO相结合,后者将不确定性惩罚的策略优化适配到稳定、在线的PPO框架中,用于真实世界控制。我们在仿真和真实四足及人形机器人上,基于多样化的操作和移动任务评估了我们的方法,策略完全从离线数据集中训练。所得策略一致地优于无模型和无不确定性感知的模型基线,并且在模型学习中融合真实世界数据进一步产生了超越仅在仿真中训练的在线无模型基线的稳健策略。