Denoising Diffusion models have exhibited remarkable capabilities in image generation. However, generating high-quality samples requires a large number of iterations. Knowledge distillation for diffusion models is an effective method to address this limitation with a shortened sampling process but causes degraded generative quality. Based on our analysis with bias-variance decomposition and experimental observations, we attribute the degradation to the spatial fitting error occurring in the training of both the teacher and student model. Accordingly, we propose $\textbf{S}$patial $\textbf{F}$itting-$\textbf{E}$rror $\textbf{R}$eduction $\textbf{D}$istillation model ($\textbf{SFERD}$). SFERD utilizes attention guidance from the teacher model and a designed semantic gradient predictor to reduce the student's fitting error. Empirically, our proposed model facilitates high-quality sample generation in a few function evaluations. We achieve an FID of 5.31 on CIFAR-10 and 9.39 on ImageNet 64$\times$64 with only one step, outperforming existing diffusion methods. Our study provides a new perspective on diffusion distillation by highlighting the intrinsic denoising ability of models.
翻译:去噪扩散模型在图像生成领域展现出卓越能力,但生成高质量样本需要大量迭代步骤。针对扩散模型的知识蒸馏是解决该限制的有效方法,虽然能缩短采样过程,但会导致生成质量下降。基于偏差-方差分解分析与实验观测,我们将这种退化归因于教师模型与学生模型训练中出现的空间拟合误差。据此,我们提出空间拟合误差缩减蒸馏模型(SFERD)。SFERD通过利用教师模型的注意力引导机制与设计的语义梯度预测器,有效降低学生的拟合误差。实验表明,本模型仅需少量函数评估即可生成高质量样本:在CIFAR-10数据集上以单步采样达到5.31的FID值,在ImageNet 64×64上达到9.39的FID值,超越现有扩散方法。本研究通过揭示模型内在去噪能力为扩散蒸馏提供了新视角。