Diffusion models generate samples by reversing a fixed forward diffusion process. Despite already providing impressive empirical results, these diffusion models algorithms can be further improved by reducing the variance of the training targets in their denoising score-matching objective. We argue that the source of such variance lies in the handling of intermediate noise-variance scales, where multiple modes in the data affect the direction of reverse paths. We propose to remedy the problem by incorporating a reference batch which we use to calculate weighted conditional scores as more stable training targets. We show that the procedure indeed helps in the challenging intermediate regime by reducing (the trace of) the covariance of training targets. The new stable targets can be seen as trading bias for reduced variance, where the bias vanishes with increasing reference batch size. Empirically, we show that the new objective improves the image quality, stability, and training speed of various popular diffusion models across datasets with both general ODE and SDE solvers. When used in combination with EDM, our method yields a current SOTA FID of 1.90 with 35 network evaluations on the unconditional CIFAR-10 generation task. The code is available at https://github.com/Newbeeer/stf
翻译:扩散模型通过反转一个固定的前向扩散过程来生成样本。尽管这些扩散模型算法已经展现出令人印象深刻的实证结果,但通过降低其去噪分数匹配目标中训练目标的方差,可以进一步改进。我们认为,这种方差的根源在于对中间噪声方差尺度的处理,其中数据中的多个模式会影响反向路径的方向。我们提出通过引入一个参考批次来计算加权条件分数,作为更稳定的训练目标,从而解决这一问题。我们证明,这一过程确实有助于具有挑战性的中间区域,因为它降低了训练目标协方差的迹。新的稳定目标可以看作是以偏差换取方差降低,其中偏差随着参考批次大小的增加而消失。实证上,我们证明新目标提高了多种流行扩散模型在数据集上的图像质量、稳定性和训练速度,这些模型同时使用通用的常微分方程和随机微分方程求解器。当与EDM结合使用时,我们的方法在无条件CIFAR-10生成任务上,通过35次网络评估,实现了当前最优的FID分数1.90。代码可在https://github.com/Newbeeer/stf获取。