Diffusion models generate samples by reversing a fixed forward diffusion process. Despite already providing impressive empirical results, these diffusion models algorithms can be further improved by reducing the variance of the training targets in their denoising score-matching objective. We argue that the source of such variance lies in the handling of intermediate noise-variance scales, where multiple modes in the data affect the direction of reverse paths. We propose to remedy the problem by incorporating a reference batch which we use to calculate weighted conditional scores as more stable training targets. We show that the procedure indeed helps in the challenging intermediate regime by reducing (the trace of) the covariance of training targets. The new stable targets can be seen as trading bias for reduced variance, where the bias vanishes with increasing reference batch size. Empirically, we show that the new objective improves the image quality, stability, and training speed of various popular diffusion models across datasets with both general ODE and SDE solvers. When used in combination with EDM, our method yields a current SOTA FID of 1.90 with 35 network evaluations on the unconditional CIFAR-10 generation task. The code is available at https://github.com/Newbeeer/stf
翻译:扩散模型通过逆转固定的前向扩散过程来生成样本。尽管已展现出令人印象深刻的实证结果,但这类扩散模型算法可通过降低去噪分数匹配目标中训练目标的方差来进一步改进。我们认为方差来源在于中间噪声尺度层级的处理——该阶段数据中的多模态性会影响反向路径的方向。为此,我们提出引入参考批次来计算加权条件分数,将其作为更稳定的训练目标。实验表明,该过程通过降低训练目标协方差(的迹)确实有助于应对具有挑战性的中间阶段。新稳定目标可视作用偏差换取方差缩减,其中偏差随参考批次增大而消失。实证结果显示,该目标能提升多种主流扩散模型在跨数据集(结合通用ODE与SDE求解器)上的图像质量、稳定性及训练速度。当与EDM结合使用时,我们的方法在无条件CIFAR-10生成任务中仅需35次网络评估即可达到当前最优的FID值1.90。代码开源于 https://github.com/Newbeeer/stf