Reinforcement Learning (RL) has achieved impressive results on complex tasks but struggles in multi-task settings with different embodiments. World models offer scalability by learning a simulation of the environment, yet they often rely on inefficient gradient-free optimization methods. We introduce Policy learning with large World Models (PWM), a novel model-based RL algorithm that learns continuous control policies from large multi-task world models. By pre-training the world model on offline data and using it for first-order gradient policy learning, PWM effectively solves tasks with up to 152 action dimensions and outperforms methods using ground-truth dynamics. Additionally, PWM scales to an 80-task setting, achieving up to 27% higher rewards than existing baselines without the need for expensive online planning. Visualizations and code available at https://www.imgeorgiev.com/pwm
翻译:强化学习(RL)在复杂任务上取得了令人瞩目的成果,但在具有不同具身形态的多任务场景中仍面临挑战。世界模型通过学习环境模拟提供了可扩展性,但通常依赖于低效的无梯度优化方法。本文提出基于大型世界模型的策略学习(PWM),这是一种新颖的基于模型的强化学习算法,能够从大型多任务世界模型中学习连续控制策略。通过离线数据预训练世界模型并利用其一阶梯度进行策略学习,PWM有效解决了动作维度高达152的任务,其性能优于使用真实动力学的方法。此外,PWM可扩展至80个任务的场景,在不依赖昂贵在线规划的情况下,比现有基线方法获得最高27%的奖励提升。可视化结果与代码详见 https://www.imgeorgiev.com/pwm