While diffusion models can learn complex distributions, sampling requires a computationally expensive iterative process. Existing distillation methods enable efficient sampling, but have notable limitations, such as performance degradation with very few sampling steps, reliance on training data access, or mode-seeking optimization that may fail to capture the full distribution. We propose EM Distillation (EMD), a maximum likelihood-based approach that distills a diffusion model to a one-step generator model with minimal loss of perceptual quality. Our approach is derived through the lens of Expectation-Maximization (EM), where the generator parameters are updated using samples from the joint distribution of the diffusion teacher prior and inferred generator latents. We develop a reparametrized sampling scheme and a noise cancellation technique that together stabilizes the distillation process. We further reveal an interesting connection of our method with existing methods that minimize mode-seeking KL. EMD outperforms existing one-step generative methods in terms of FID scores on ImageNet-64 and ImageNet-128, and compares favorably with prior work on distilling text-to-image diffusion models.
翻译:尽管扩散模型能够学习复杂分布,其采样过程需要计算成本高昂的迭代步骤。现有蒸馏方法虽能实现高效采样,但存在显著局限:例如在采样步数极少时性能下降、依赖训练数据访问权限,或采用难以捕捉完整分布的模态寻求优化。我们提出期望最大化蒸馏(EMD),一种基于最大似然估计的方法,可将扩散模型蒸馏为单步生成器模型,同时最小化感知质量损失。该方法通过期望最大化(EM)框架推导得出,其中生成器参数使用来自扩散教师先验与推断生成器潜变量的联合分布样本进行更新。我们开发了重参数化采样方案与噪声消除技术,共同稳定了蒸馏过程。进一步揭示了本方法与现有最小化模态寻求KL散度方法之间的有趣联系。在ImageNet-64和ImageNet-128数据集上,EMD在FID指标方面优于现有单步生成方法,并在文本到图像扩散模型蒸馏任务中表现出优于先前工作的性能。