Bridging the gap between diffusion models and human preferences is crucial for their integration into practical generative workflows. While optimizing downstream reward models has emerged as a promising alignment strategy, concerns arise regarding the risk of excessive optimization with learned reward models, which potentially compromises ground-truth performance. In this work, we confront the reward overoptimization problem in diffusion model alignment through the lenses of both inductive and primacy biases. We first identify the divergence of current methods from the temporal inductive bias inherent in the multi-step denoising process of diffusion models as a potential source of overoptimization. Then, we surprisingly discover that dormant neurons in our critic model act as a regularization against overoptimization, while active neurons reflect primacy bias in this setting. Motivated by these observations, we propose Temporal Diffusion Policy Optimization with critic active neuron Reset (TDPO-R), a policy gradient algorithm that exploits the temporal inductive bias of intermediate timesteps, along with a novel reset strategy that targets active neurons to counteract the primacy bias. Empirical results demonstrate the superior efficacy of our algorithms in mitigating reward overoptimization.
翻译:弥合扩散模型与人类偏好之间的差距对于将其整合到实际生成工作流程中至关重要。虽然优化下游奖励模型已成为一种有前景的对齐策略,但关于使用学习得到的奖励模型进行过度优化的风险引发了担忧,这可能会损害真实性能。在本文中,我们从归纳偏差和优先偏差两个角度,直面扩散模型对齐中的奖励过度优化问题。我们首先发现当前方法与扩散模型多步去噪过程中固有的时间归纳偏差存在偏离,这可能是过度优化的来源之一。随后,我们惊奇地发现,评论家模型中的休眠神经元对过度优化起到了正则化作用,而活跃神经元则反映了该设置下的优先偏差。受这些观察启发,我们提出了带评论家活跃神经元重置的时间扩散策略优化(TDPO-R),这是一种策略梯度算法,该算法利用了中间时间步的时间归纳偏差,并采用了一种针对活跃神经元的新型重置策略来对抗优先偏差。实验结果表明,我们的算法在缓解奖励过度优化方面具有优越效果。