World models are a powerful tool for developing intelligent agents. By predicting the outcome of a sequence of actions, world models enable policies to be optimised via on-policy reinforcement learning (RL) using synthetic data, i.e. in "in imagination". Existing world models are autoregressive in that they interleave predicting the next state with sampling the next action from the policy. Prediction error inevitably compounds as the trajectory length grows. In this work, we propose a novel world modelling approach that is not autoregressive and generates entire on-policy trajectories in a single pass through a diffusion model. Our approach, Policy-Guided Trajectory Diffusion (PolyGRAD), leverages a denoising model in addition to the gradient of the action distribution of the policy to diffuse a trajectory of initially random states and actions into an on-policy synthetic trajectory. We analyse the connections between PolyGRAD, score-based generative models, and classifier-guided diffusion models. Our results demonstrate that PolyGRAD outperforms state-of-the-art baselines in terms of trajectory prediction error for short trajectories, with the exception of autoregressive diffusion. For short trajectories, PolyGRAD obtains similar errors to autoregressive diffusion, but with lower computational requirements. For long trajectories, PolyGRAD obtains comparable performance to baselines. Our experiments demonstrate that PolyGRAD enables performant policies to be trained via on-policy RL in imagination for MuJoCo continuous control domains. Thus, PolyGRAD introduces a new paradigm for accurate on-policy world modelling without autoregressive sampling.
翻译:世界模型是开发智能体的强大工具。通过预测一系列行动的结果,世界模型使得策略能够利用合成数据(即在“想象中”)通过在线强化学习进行优化。现有世界模型是自回归的,即它们将预测下一状态与从策略中采样下一个行动交替进行。随着轨迹长度增长,预测误差不可避免地累积。在本文中,我们提出了一种非自回归的世界建模方法,该方法通过扩散模型一次性生成完整的在线策略轨迹。我们的方法——策略引导轨迹扩散(PolyGRAD)——利用去噪模型以及策略行动分布的梯度,将初始随机状态和行动的轨迹扩散为在线的合成轨迹。我们分析了PolyGRAD、基于分数的生成模型和分类器引导扩散模型之间的联系。结果表明,除了自回归扩散外,PolyGRAD在短轨迹的轨迹预测误差上优于最先进的基线方法。对于短轨迹,PolyGRAD获得的误差与自回归扩散相似,但计算需求更低。对于长轨迹,PolyGRAD与基线方法的性能相当。实验证明,PolyGRAD能够使MuJoCo连续控制领域通过在线强化学习在想象中训练出高性能策略。因此,PolyGRAD引入了一种无需自回归采样的精确在线世界建模新范式。