World Models via Policy-Guided Trajectory Diffusion

World models are a powerful tool for developing intelligent agents. By predicting the outcome of a sequence of actions, world models enable policies to be optimised via on-policy reinforcement learning (RL) using synthetic data, i.e. in "in imagination". Existing world models are autoregressive in that they interleave predicting the next state with sampling the next action from the policy. Prediction error inevitably compounds as the trajectory length grows. In this work, we propose a novel world modelling approach that is not autoregressive and generates entire on-policy trajectories in a single pass through a diffusion model. Our approach, Policy-Guided Trajectory Diffusion (PolyGRAD), leverages a denoising model in addition to the gradient of the action distribution of the policy to diffuse a trajectory of initially random states and actions into an on-policy synthetic trajectory. We analyse the connections between PolyGRAD, score-based generative models, and classifier-guided diffusion models. Our results demonstrate that PolyGRAD outperforms state-of-the-art baselines in terms of trajectory prediction error for short trajectories, with the exception of autoregressive diffusion. For short trajectories, PolyGRAD obtains similar errors to autoregressive diffusion, but with lower computational requirements. For long trajectories, PolyGRAD obtains comparable performance to baselines. Our experiments demonstrate that PolyGRAD enables performant policies to be trained via on-policy RL in imagination for MuJoCo continuous control domains. Thus, PolyGRAD introduces a new paradigm for accurate on-policy world modelling without autoregressive sampling.

翻译：世界模型是开发智能体的强大工具。通过预测一系列行动的结果，世界模型使得策略能够利用合成数据（即在“想象中”）通过在线强化学习进行优化。现有世界模型是自回归的，即它们将预测下一状态与从策略中采样下一个行动交替进行。随着轨迹长度增长，预测误差不可避免地累积。在本文中，我们提出了一种非自回归的世界建模方法，该方法通过扩散模型一次性生成完整的在线策略轨迹。我们的方法——策略引导轨迹扩散（PolyGRAD）——利用去噪模型以及策略行动分布的梯度，将初始随机状态和行动的轨迹扩散为在线的合成轨迹。我们分析了PolyGRAD、基于分数的生成模型和分类器引导扩散模型之间的联系。结果表明，除了自回归扩散外，PolyGRAD在短轨迹的轨迹预测误差上优于最先进的基线方法。对于短轨迹，PolyGRAD获得的误差与自回归扩散相似，但计算需求更低。对于长轨迹，PolyGRAD与基线方法的性能相当。实验证明，PolyGRAD能够使MuJoCo连续控制领域通过在线强化学习在想象中训练出高性能策略。因此，PolyGRAD引入了一种无需自回归采样的精确在线世界建模新范式。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/