By formulating data samples' formation as a Markov denoising process, diffusion models achieve state-of-the-art performances in a collection of tasks. Recently, many variants of diffusion models have been proposed to enable controlled sample generation. Most of these existing methods either formulate the controlling information as an input (i.e.,: conditional representation) for the noise approximator, or introduce a pre-trained classifier in the test-phase to guide the Langevin dynamic towards the conditional goal. However, the former line of methods only work when the controlling information can be formulated as conditional representations, while the latter requires the pre-trained guidance classifier to be differentiable. In this paper, we propose a novel framework named RGDM (Reward-Guided Diffusion Model) that guides the training-phase of diffusion models via reinforcement learning (RL). The proposed training framework bridges the objective of weighted log-likelihood and maximum entropy RL, which enables calculating policy gradients via samples from a pay-off distribution proportional to exponential scaled rewards, rather than from policies themselves. Such a framework alleviates the high gradient variances and enables diffusion models to explore for highly rewarded samples in the reverse process. Experiments on 3D shape and molecule generation tasks show significant improvements over existing conditional diffusion models.
翻译:通过将数据样本的形成过程建模为马尔可夫去噪过程,扩散模型在一系列任务中取得了最先进的性能。近年来,研究者提出了多种扩散模型变体以实现可控样本生成。现有方法大多将控制信息作为噪声近似器的输入(即条件表示),或在测试阶段引入预训练分类器引导朗之万动力学朝向条件目标。然而,前一类方法仅适用于控制信息可表示为条件表示的场景,而后一类方法要求预训练引导分类器可微。本文提出了一种名为RGDM(奖励引导扩散模型)的新型框架,通过强化学习引导扩散模型的训练阶段。该训练框架建立了加权对数似然与最大熵强化学习目标之间的关联,使得策略梯度可通过来自指数缩放奖励比例分布的样本而非策略本身进行计算。该框架有效缓解了梯度高方差问题,使扩散模型能在逆向过程中探索高奖励样本。在三维形状和分子生成任务上的实验表明,该模型相比现有条件扩散模型取得了显著改进。