While diffusion models can learn complex distributions, sampling requires a computationally expensive iterative process. Existing distillation methods enable efficient sampling, but have notable limitations, such as performance degradation with very few sampling steps, reliance on training data access, or mode-seeking optimization that may fail to capture the full distribution. We propose EM Distillation (EMD), a maximum likelihood-based approach that distills a diffusion model to a one-step generator model with minimal loss of perceptual quality. Our approach is derived through the lens of Expectation-Maximization (EM), where the generator parameters are updated using samples from the joint distribution of the diffusion teacher prior and inferred generator latents. We develop a reparametrized sampling scheme and a noise cancellation technique that together stabilizes the distillation process. We further reveal an interesting connection of our method with existing methods that minimize mode-seeking KL. EMD outperforms existing one-step generative methods in terms of FID scores on ImageNet-64 and ImageNet-128, and compares favorably with prior work on distilling text-to-image diffusion models.
翻译:尽管扩散模型能够学习复杂分布,其采样过程需要计算昂贵的迭代操作。现有蒸馏方法虽能实现高效采样,但存在明显局限:例如在极少采样步数下性能显著退化、依赖训练数据访问权限,或采用难以捕捉完整分布的模态寻求优化。我们提出基于最大似然的期望最大化蒸馏方法,将扩散模型蒸馏为一步生成器模型,同时保持最小的感知质量损失。该方法通过期望最大化框架推导得出,其中生成器参数使用来自扩散教师先验与推断生成器隐变量的联合分布样本进行更新。我们开发了重参数化采样方案与噪声消除技术,共同稳定了蒸馏过程。进一步揭示了本方法与现有最小化模态寻求KL散度方法之间的有趣联系。在ImageNet-64和ImageNet-128数据集上,EMD在FID指标上优于现有一步生成方法,并在文本到图像扩散模型蒸馏任务中表现出优于先前工作的性能。