World Models via Policy-Guided Trajectory Diffusion

World models are a powerful tool for developing intelligent agents. By predicting the outcome of a sequence of actions, world models enable policies to be optimised via on-policy reinforcement learning (RL) using synthetic data, i.e. in "in imagination". Existing world models are autoregressive in that they interleave predicting the next state with sampling the next action from the policy. Prediction error inevitably compounds as the trajectory length grows. In this work, we propose a novel world modelling approach that is not autoregressive and generates entire on-policy trajectories in a single pass through a diffusion model. Our approach, Policy-Guided Trajectory Diffusion (PolyGRAD), leverages a denoising model in addition to the gradient of the action distribution of the policy to diffuse a trajectory of initially random states and actions into an on-policy synthetic trajectory. We analyse the connections between PolyGRAD, score-based generative models, and classifier-guided diffusion models. Our results demonstrate that PolyGRAD outperforms state-of-the-art baselines in terms of trajectory prediction error for moderate-length trajectories, with the exception of autoregressive diffusion. At short horizons, PolyGRAD obtains comparable errors to autoregressive diffusion, but with significantly lower computational requirements. Our experiments also demonstrate that PolyGRAD enables performant policies to be trained via on-policy RL in imagination for MuJoCo continuous control domains. Thus, PolyGRAD introduces a new paradigm for scalable and non-autoregressive on-policy world modelling.

翻译：世界模型是开发智能体的强大工具。通过预测一系列动作的结果，世界模型使得策略能够利用合成数据（即“在想象中”）进行同策略强化学习优化。现有世界模型是自回归的，它们交替进行下一状态预测和从策略中采样下一动作。随着轨迹长度增长，预测误差必然累积。在本工作中，我们提出了一种非自回归的世界建模方法，该方法通过扩散模型一次性生成完整的同策略轨迹。我们的方法——策略引导的轨迹扩散（PolyGRAD），利用去噪模型结合策略动作分布的梯度，将初始随机状态和动作的轨迹扩散为同策略合成轨迹。我们分析了PolyGRAD与基于分数的生成模型以及分类器引导的扩散模型之间的联系。结果表明，对于中等长度轨迹的轨迹预测误差，PolyGRAD在除自回归扩散外的所有基准方法中表现更优。在短时域上，PolyGRAD获得了与自回归扩散相当的误差，但计算需求显著降低。实验还表明，PolyGRAD能够在MuJoCo连续控制领域通过想象中同策略强化学习训练出高性能策略。因此，PolyGRAD引入了一种可扩展且非自回归的同策略世界建模新范式。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/