We present a maximum entropy inverse reinforcement learning (IRL) approach for improving the sample quality of diffusion generative models, especially when the number of generation time steps is small. Similar to how IRL trains a policy based on the reward function learned from expert demonstrations, we train (or fine-tune) a diffusion model using the log probability density estimated from training data. Since we employ an energy-based model (EBM) to represent the log density, our approach boils down to the joint training of a diffusion model and an EBM. Our IRL formulation, named Diffusion by Maximum Entropy IRL (DxMI), is a minimax problem that reaches equilibrium when both models converge to the data distribution. The entropy maximization plays a key role in DxMI, facilitating the exploration of the diffusion model and ensuring the convergence of the EBM. We also propose Diffusion by Dynamic Programming (DxDP), a novel reinforcement learning algorithm for diffusion models, as a subroutine in DxMI. DxDP makes the diffusion model update in DxMI efficient by transforming the original problem into an optimal control formulation where value functions replace back-propagation in time. Our empirical studies show that diffusion models fine-tuned using DxMI can generate high-quality samples in as few as 4 and 10 steps. Additionally, DxMI enables the training of an EBM without MCMC, stabilizing EBM training dynamics and enhancing anomaly detection performance.
翻译:本文提出了一种最大熵逆强化学习方法,用于提升扩散生成模型的样本质量,特别是在生成时间步数较少的情况下。类似于逆强化学习基于专家演示数据学习奖励函数来训练策略,我们利用训练数据估计的对数概率密度来训练(或微调)扩散模型。由于我们采用基于能量的模型来表示对数密度,该方法可归结为扩散模型与基于能量模型的联合训练。我们提出的逆强化学习框架,称为最大熵逆强化学习扩散法,是一个极小极大优化问题,当两个模型均收敛至数据分布时达到均衡。熵最大化在该框架中起着关键作用,既促进了扩散模型的探索,又确保了基于能量模型的收敛。我们还提出了扩散动态规划算法,作为一种用于扩散模型的新型强化学习子程序,将其作为该框架的组成部分。该算法通过将原始问题转化为最优控制形式,用时序反向传播替代价值函数,从而实现了扩散模型在该框架中的高效更新。实证研究表明,经该框架微调的扩散模型仅需4至10步即可生成高质量样本。此外,该框架能够在无需马尔可夫链蒙特卡罗采样的条件下训练基于能量模型,稳定了其训练动态并提升了异常检测性能。