RL post-training has become increasingly pivotal for improving diffusion policies, but existing diffusion policy-gradient methods are often unstable and cannot achieve reliable policy improvement. We identify the cause as the double-drift phenomenon: optimizing a variational surrogate can let the ELBO separate from the true log-likelihood, which then makes the resulting proxy policy gradient misaligned with the true policy gradient of expected return. We propose \textbf{DiPOD}, a diffusion policy optimization framework that maintains tight-bound behavior throughout training by interleaving self-distillation with policy-improving gradient updates. This leads to a simple and practical algorithm: augmenting each diffusion policy-gradient update with an on-policy ELBO regularizer. Across diffusion language model post-training and continuous-control diffusion policies, DiPOD substantially stabilizes training and reaches higher rewards than previous methods.
翻译:强化学习后训练对于提升扩散策略日益关键,但现有扩散策略梯度方法常存在不稳定问题,难以实现可靠的策略改进。我们将其归因于“双重漂移”现象:优化变分代理项可能导致ELBO与真实对数似然解耦,进而使代理策略梯度与期望回报的真实策略梯度产生偏差。对此,我们提出\textbf{DiPOD}框架——一种通过将自蒸馏与策略改进梯度更新交替执行,从而在训练全程维持紧界行为的扩散策略优化方法。该思路催生出简洁实用的算法:在每次扩散策略梯度更新中增补一个在线策略ELBO正则项。在扩散语言模型后训练及连续控制扩散策略的两类实验中,DiPOD显著稳定了训练过程,并取得了优于既有方法的奖励值。