Denoising Diffusion models have exhibited remarkable capabilities in image generation. However, generating high-quality samples requires a large number of iterations. Knowledge distillation for diffusion models is an effective method to address this limitation with a shortened sampling process but causes degraded generative quality. Based on our analysis with bias-variance decomposition and experimental observations, we attribute the degradation to the spatial fitting error occurring in the training of both the teacher and student model. Accordingly, we propose $\textbf{S}$patial $\textbf{F}$itting-$\textbf{E}$rror $\textbf{R}$eduction $\textbf{D}$istillation model ($\textbf{SFERD}$). SFERD utilizes attention guidance from the teacher model and a designed semantic gradient predictor to reduce the student's fitting error. Empirically, our proposed model facilitates high-quality sample generation in a few function evaluations. We achieve an FID of 5.31 on CIFAR-10 and 9.39 on ImageNet 64$\times$64 with only one step, outperforming existing diffusion methods. Our study provides a new perspective on diffusion distillation by highlighting the intrinsic denoising ability of models. Project link: \url{https://github.com/Sainzerjj/SFERD}.
翻译:去噪扩散模型在图像生成方面展现了卓越的能力。然而,生成高质量样本需要大量迭代步骤。针对扩散模型的知识蒸馏是一种有效缩短采样过程的方法,但会导致生成质量下降。基于偏差-方差分解的分析和实验观察,我们将这种退化归因于教师模型和学生模型训练中出现的空间拟合误差。为此,我们提出空间拟合误差减小蒸馏模型(Spatial Fitting-Error Reduction Distillation model,简称SFERD)。SFERD利用教师模型的注意力引导和设计的语义梯度预测器来减小学生的拟合误差。实验表明,我们的模型能够在少量函数评估中实现高质量样本生成。在仅需一步的情况下,我们在CIFAR-10上取得5.31的FID分数,在ImageNet 64×64上取得9.39的FID分数,性能优于现有扩散方法。我们的研究通过强调模型的内在去噪能力,为扩散蒸馏提供了新的视角。项目链接:\url{https://github.com/Sainzerjj/SFERD}。