World models offer a promising avenue for more faithfully capturing complex dynamics, including contacts and non-rigidity, as well as complex sensory information, such as visual perception, in situations where standard simulators struggle. However, these models are computationally complex to evaluate, posing a challenge for popular RL approaches that have been successfully used with simulators to solve complex locomotion tasks but yet struggle with manipulation. This paper introduces a method that bypasses simulators entirely, training RL policies inside world models learned from robots' interactions with real environments. At its core, our approach enables policy training with large-scale diffusion models via a novel decoupled first-order gradient (FoG) method: a full-scale world model generates accurate forward trajectories, while a lightweight latent-space surrogate approximates its local dynamics for efficient gradient computation. This coupling of a local and global world model ensures high-fidelity unrolling alongside computationally tractable differentiation. We demonstrate the efficacy of our method on the Push-T manipulation task, where it significantly outperforms PPO in sample efficiency. We further evaluate our approach through an ego-centric object manipulation task with a quadruped. Together, these results demonstrate that learning inside data-driven world models is a promising pathway for solving hard-to-model RL tasks in image space without reliance on hand-crafted physics simulators.
翻译:世界模型为更精确地捕捉复杂动态(包括接触与非刚性效应)及复杂感知信息(如视觉感知)提供了有前景的途径,尤其适用于标准仿真器难以处理的场景。然而,这类模型的计算复杂度较高,对已成功应用于复杂运动任务但仍在操作任务中面临挑战的主流RL方法构成了障碍。本文提出一种完全绕过仿真器的方法,通过在从机器人真实环境交互中学习的世界模型内部训练RL策略。本方法的核心是通过新颖的解耦一阶梯度(FoG)方法实现基于大规模扩散模型的策略训练:完整规模的世界模型生成精确的前向轨迹,而轻量级潜空间代理模型则近似其局部动态以实现高效梯度计算。这种局部与全局世界模型的耦合机制,在保证高保真轨迹展开的同时实现了计算可行的微分运算。我们在Push-T操作任务上验证了本方法的有效性,其样本效率显著超越PPO算法。进一步通过四足机器人的以自我为中心物体操作任务进行评估。综合结果表明:在数据驱动的世界模型内部进行学习,为在图像空间中解决难以建模的RL任务提供了一条不依赖手工物理仿真器的可行路径。