Diffusion alignment adapts pretrained diffusion models to sample from reward-tilted distributions along the denoising trajectory. This process naturally admits a Sequential Monte Carlo (SMC) interpretation, where the denoising model acts as a proposal and reward guidance induces importance weights. Motivated by this view, we introduce Variance Minimisation Policy Optimisation (VMPO), which formulates diffusion alignment as minimising the variance of log importance weights rather than directly optimising a Kullback-Leibler (KL) based objective. We prove that the variance objective is minimised by the reward-tilted target distribution and that, under on-policy sampling, its gradient coincides with that of standard KL-based alignment. This perspective offers a common lens for understanding diffusion alignment. Under different choices of potential functions and variance minimisation strategies, VMPO recovers various existing methods, while also suggesting new design directions beyond KL.
翻译:扩散对齐旨在调整预训练的扩散模型,使其在去噪轨迹中采样奖励倾斜的分布。这一过程自然地允许采用序贯蒙特卡洛(SMC)解释,其中去噪模型充当提议分布,而奖励引导则引入重要性权重。受此观点启发,我们提出了方差最小化策略优化(VMPO),它将扩散对齐形式化为最小化对数重要性权重的方差,而非直接优化基于Kullback-Leibler(KL)散度的目标。我们证明了方差目标由奖励倾斜的目标分布最小化,并且在同策略采样下,其梯度与标准的基于KL的对齐梯度一致。这一视角为理解扩散对齐提供了一个统一的框架。在不同的势函数选择和方差最小化策略下,VMPO不仅能够恢复多种现有方法,同时也为超越KL的设计方向提出了新的思路。