Diffusion models have achieved remarkable success in sequential decision-making by leveraging the highly expressive model capabilities in policy learning. A central problem for learning diffusion policies is to align the policy output with human intents in various tasks. To achieve this, previous methods conduct return-conditioned policy generation or Reinforcement Learning (RL)-based policy optimization, while they both rely on pre-defined reward functions. In this work, we propose a novel framework, Forward KL regularized Preference optimization for aligning Diffusion policies, to align the diffusion policy with preferences directly. We first train a diffusion policy from the offline dataset without considering the preference, and then align the policy to the preference data via direct preference optimization. During the alignment phase, we formulate direct preference learning in a diffusion policy, where the forward KL regularization is employed in preference optimization to avoid generating out-of-distribution actions. We conduct extensive experiments for MetaWorld manipulation and D4RL tasks. The results show our method exhibits superior alignment with preferences and outperforms previous state-of-the-art algorithms.
翻译:扩散模型通过利用其高度表达性的建模能力在序列决策制定中取得了显著成功。学习扩散策略的一个核心问题是在各种任务中将策略输出与人类意图对齐。为实现这一目标,先前的方法采用基于回报条件的策略生成或基于强化学习(RL)的策略优化,但两者都依赖于预定义的奖励函数。在本工作中,我们提出了一种新颖的框架——前向KL正则化偏好优化用于对齐扩散策略,以直接使扩散策略与偏好对齐。我们首先在不考虑偏好的情况下从离线数据集训练一个扩散策略,然后通过直接偏好优化将该策略与偏好数据对齐。在对齐阶段,我们将直接偏好学习形式化于扩散策略中,其中在偏好优化中采用前向KL正则化以避免生成分布外动作。我们在MetaWorld操作任务和D4RL任务上进行了广泛的实验。结果表明,我们的方法在偏好对齐方面表现出优越性,并超越了先前最先进的算法。