The probabilistic diffusion model (DM), generating content by inferencing through a recursive chain structure, has emerged as a powerful framework for visual generation. After pre-training on enormous data, the model needs to be properly aligned to meet requirements for downstream applications. How to efficiently align the foundation DM is a crucial task. Contemporary methods are either based on Reinforcement Learning (RL) or truncated Backpropagation (BP). However, RL and truncated BP suffer from low sample efficiency and biased gradient estimation, respectively, resulting in limited improvement or, even worse, complete training failure. To overcome the challenges, we propose the Recursive Likelihood Ratio (RLR) optimizer, a Half-Order (HO) fine-tuning paradigm for DM. The HO gradient estimator enables the computation graph rearrangement within the recursive diffusive chain, making the RLR's gradient estimator an unbiased one with lower variance than other methods. We theoretically investigate the bias, variance, and convergence of our method. Extensive experiments are conducted on image and video generation to validate the superiority of the RLR. Furthermore, we propose a novel prompt technique that is natural for the RLR to achieve a synergistic effect.
翻译:概率扩散模型(DM)通过递归链式结构进行推理生成内容,已成为视觉生成领域的强大框架。在大量数据上进行预训练后,模型需要经过适当对齐以满足下游应用需求。如何高效对齐基础DM是一项关键任务。现有方法主要基于强化学习(RL)或截断反向传播(BP),但RL存在样本效率低的问题,而截断BP则存在梯度估计偏差,导致改进有限甚至训练完全失败。为克服这些挑战,我们提出递归似然比(RLR)优化器,一种针对DM的半阶(HO)微调范式。该HO梯度估计器能够在递归扩散链中实现计算图重构,使RLR的梯度估计成为无偏估计,且方差低于其他方法。我们从理论上分析了该方法的偏差、方差和收敛性。通过在图像和视频生成任务上的大量实验,验证了RLR的优越性。此外,我们提出一种新颖的提示技术,该技术能与RLR自然结合以实现协同效应。