Diffusion models are typically trained with objectives that focus on local denoising targets at individual time steps (or adjacent pairs), which do not enforce consistency between predictions along the denoising trajectory. This lack of cross-time consistency can degrade performance, especially for few-step samplers. We introduce a temporal difference (TD) objective that penalizes inconsistency of the model's multi-step progress along the denoising path. By reformulating the diffusion process as a Markov reward process and casting denoising as a policy evaluation problem in reinforcement learning, we derive a unified TD approach that applies to both discrete- and continuous-time diffusion formulations. We further propose a principled sample-based reweighting method that stabilizes training. Empirically, we show that using our TD training can significantly improve sample quality measured by FID, with stronger advantages when the number of sampling steps is small, highlighting its practical utility under low-computation-budget scenarios. We provide ablation studies to justify our design choices, including pairwise loss reweighting, regularization weight, and one-step stride. Overall, our TD approach can be a general drop-in that enforces cross-time consistency and improves generation quality across different diffusion generative models.
翻译:扩散模型通常通过聚焦于单个时间步(或相邻步)的局部去噪目标进行训练,这种做法并未强制要求在去噪轨迹上预测的一致性。这种跨时间一致性的缺失会降低模型性能,尤其是在少步采样器中。我们提出一种时间差分(TD)目标函数,通过惩罚模型沿去噪路径的多步进展不一致性来解决此问题。通过将扩散过程重新表述为马尔可夫奖励过程,并将去噪视为强化学习中的策略评估问题,我们推导出一种统一的TD方法,可同时适用于离散时间和连续时间扩散公式。此外,我们提出一种基于样本的原则性重加权方法以稳定训练。实验表明,使用我们的TD训练能显著提升由FID衡量的样本质量,且在采样步数较少时优势更为突出,凸显其在低计算预算场景下的实用价值。我们通过消融研究验证了设计选择的合理性,包括成对损失重加权、正则化权重及单步跨度。总体而言,我们的TD方法可作为一种通用即插即用模块,通过强制跨时间一致性来提升各类扩散生成模型的生成质量。