With the great success of diffusion models (DMs) in generating realistic synthetic vision data, many researchers have investigated their potential in decision-making and control. Most of these works utilized DMs to sample directly from the trajectory space, where DMs can be viewed as a combination of dynamics models and policies. In this work, we explore how to decouple DMs' ability as dynamics models in fully offline settings, allowing the learning policy to roll out trajectories. As DMs learn the data distribution from the dataset, their intrinsic policy is actually the behavior policy induced from the dataset, which results in a mismatch between the behavior policy and the learning policy. We propose Dynamics Diffusion, short as DyDiff, which can inject information from the learning policy to DMs iteratively. DyDiff ensures long-horizon rollout accuracy while maintaining policy consistency and can be easily deployed on model-free algorithms. We provide theoretical analysis to show the advantage of DMs on long-horizon rollout over models and demonstrate the effectiveness of DyDiff in the context of offline reinforcement learning, where the rollout dataset is provided but no online environment for interaction. Our code is at https://github.com/FineArtz/DyDiff.
翻译:随着扩散模型在生成逼真合成视觉数据方面的巨大成功,许多研究者开始探索其在决策与控制领域的潜力。现有工作大多利用扩散模型直接从轨迹空间采样,此时扩散模型可视为动力学模型与策略的结合体。本研究旨在探究如何在完全离线环境下解耦扩散模型作为动力学模型的能力,使学习策略能够进行轨迹推演。由于扩散模型从数据集中学习数据分布,其内在策略本质上是数据集诱导的行为策略,这会导致行为策略与学习策略之间的不匹配。我们提出动力学扩散模型,简称DyDiff,该模型能够迭代地将学习策略的信息注入扩散模型。DyDiff在保持策略一致性的同时确保了长程推演的准确性,并可便捷地部署于无模型算法。我们通过理论分析证明了扩散模型在长程推演方面相较于传统模型的优势,并在离线强化学习场景中验证了DyDiff的有效性——该场景仅提供推演数据集而无需在线交互环境。代码开源地址:https://github.com/FineArtz/DyDiff。