World models are a powerful tool for developing intelligent agents. By predicting the outcome of a sequence of actions, world models enable policies to be optimised via on-policy reinforcement learning (RL) using synthetic data, i.e. in "in imagination". Existing world models are autoregressive in that they interleave predicting the next state with sampling the next action from the policy. Prediction error inevitably compounds as the trajectory length grows. In this work, we propose a novel world modelling approach that is not autoregressive and generates entire on-policy trajectories in a single pass through a diffusion model. Our approach, Policy-Guided Trajectory Diffusion (PolyGRAD), leverages a denoising model in addition to the gradient of the action distribution of the policy to diffuse a trajectory of initially random states and actions into an on-policy synthetic trajectory. We analyse the connections between PolyGRAD, score-based generative models, and classifier-guided diffusion models. Our results demonstrate that PolyGRAD outperforms state-of-the-art baselines in terms of trajectory prediction error for moderate-length trajectories, with the exception of autoregressive diffusion. At short horizons, PolyGRAD obtains comparable errors to autoregressive diffusion, but with significantly lower computational requirements. Our experiments also demonstrate that PolyGRAD enables performant policies to be trained via on-policy RL in imagination for MuJoCo continuous control domains. Thus, PolyGRAD introduces a new paradigm for scalable and non-autoregressive on-policy world modelling.
翻译:世界模型是开发智能体的强大工具。通过预测一系列动作的结果,世界模型使得策略能够利用合成数据(即“在想象中”)进行同策略强化学习优化。现有世界模型是自回归的,它们交替进行下一状态预测和从策略中采样下一动作。随着轨迹长度增长,预测误差必然累积。在本工作中,我们提出了一种非自回归的世界建模方法,该方法通过扩散模型一次性生成完整的同策略轨迹。我们的方法——策略引导的轨迹扩散(PolyGRAD),利用去噪模型结合策略动作分布的梯度,将初始随机状态和动作的轨迹扩散为同策略合成轨迹。我们分析了PolyGRAD与基于分数的生成模型以及分类器引导的扩散模型之间的联系。结果表明,对于中等长度轨迹的轨迹预测误差,PolyGRAD在除自回归扩散外的所有基准方法中表现更优。在短时域上,PolyGRAD获得了与自回归扩散相当的误差,但计算需求显著降低。实验还表明,PolyGRAD能够在MuJoCo连续控制领域通过想象中同策略强化学习训练出高性能策略。因此,PolyGRAD引入了一种可扩展且非自回归的同策略世界建模新范式。