Reinforcement Learning (RL) has achieved impressive results in robotics, yet high-performing pipelines remain highly task-specific, with little reuse of prior data. Offline Model-based RL (MBRL) offers greater data efficiency by training policies entirely from existing datasets, but suffers from compounding errors and distribution shift in long-horizon rollouts. Although existing methods have shown success in controlled simulation benchmarks, robustly applying them to the noisy, biased, and partially observed datasets typical of real-world robotics remains challenging. We present a principled pipeline for making offline MBRL effective on physical robots. Our RWM-U extends autoregressive world models with epistemic uncertainty estimation, enabling temporally consistent multi-step rollouts with uncertainty effectively propagated over long horizons. We combine RWM-U with MOPO-PPO, which adapts uncertainty-penalized policy optimization to the stable, on-policy PPO framework for real-world control. We evaluate our approach on diverse manipulation and locomotion tasks in simulation and on real quadruped and humanoid, training policies entirely from offline datasets. The resulting policies consistently outperform model-free and uncertainty-unaware model-based baselines, and fusing real-world data in model learning further yields robust policies that surpass online model-free baselines trained solely in simulation.
翻译:强化学习(Reinforcement Learning,RL)在机器人领域已取得令人瞩目的成果,然而高性能的流程仍高度任务特定化,对先前数据的复用极少。离线模型强化学习(Model-based RL,MBRL)通过完全从现有数据集中训练策略,提供了更高的数据效率,但在长时域推演中易受复合误差和分布偏移的影响。尽管现有方法在受控的仿真基准测试中已展现出成功,但将其稳健地应用于真实世界机器人领域中典型的噪声、有偏且部分可观测的数据集仍具挑战性。我们提出了一种使离线MBRL在物理机器人上有效的原则性流程。我们的RWM-U通过认知不确定性估计扩展了自回归世界模型,实现了时间一致的多步推演,并能有效将不确定性在长时域上传播。我们将RWM-U与MOPO-PPO结合,后者将不确定性惩罚的策略优化适配到稳定、同策略的PPO框架中,以用于真实世界控制。我们在仿真和真实四足及人形机器人上,针对多种操作与移动任务评估了我们的方法,策略完全从离线数据集中训练。所得策略始终优于无模型及未考虑不确定性的模型基线方法,并且在模型学习中融合真实世界数据进一步产生了超越仅在仿真中训练的在线无模型基线的稳健策略。