Improved Distribution Matching Distillation for Fast Image Synthesis

Recent approaches have shown promises distilling diffusion models into efficient one-step generators. Among them, Distribution Matching Distillation (DMD) produces one-step generators that match their teacher in distribution, without enforcing a one-to-one correspondence with the sampling trajectories of their teachers. However, to ensure stable training, DMD requires an additional regression loss computed using a large set of noise-image pairs generated by the teacher with many steps of a deterministic sampler. This is costly for large-scale text-to-image synthesis and limits the student's quality, tying it too closely to the teacher's original sampling paths. We introduce DMD2, a set of techniques that lift this limitation and improve DMD training. First, we eliminate the regression loss and the need for expensive dataset construction. We show that the resulting instability is due to the fake critic not estimating the distribution of generated samples accurately and propose a two time-scale update rule as a remedy. Second, we integrate a GAN loss into the distillation procedure, discriminating between generated samples and real images. This lets us train the student model on real data, mitigating the imperfect real score estimation from the teacher model, and enhancing quality. Lastly, we modify the training procedure to enable multi-step sampling. We identify and address the training-inference input mismatch problem in this setting, by simulating inference-time generator samples during training time. Taken together, our improvements set new benchmarks in one-step image generation, with FID scores of 1.28 on ImageNet-64x64 and 8.35 on zero-shot COCO 2014, surpassing the original teacher despite a 500X reduction in inference cost. Further, we show our approach can generate megapixel images by distilling SDXL, demonstrating exceptional visual quality among few-step methods.

翻译：近期研究显示，将扩散模型蒸馏为高效单步生成器具有良好前景。其中，分布匹配蒸馏（DMD）产生的单步生成器能在分布上与教师模型匹配，而无需强制其与教师模型的采样轨迹保持一一对应。然而，为确保训练稳定性，DMD需要额外使用由教师模型通过多步确定性采样器生成的大量噪声-图像对来计算回归损失。这对于大规模文本到图像合成而言成本高昂，且限制了学生模型的质量，使其过于紧密地依赖于教师模型的原始采样路径。我们提出了DMD2，这是一套能够突破此限制并改进DMD训练的技术。首先，我们消除了回归损失及昂贵数据集构建的需求。我们证明由此产生的不稳定性源于伪判别器未能准确估计生成样本的分布，并提出采用双时间尺度更新规则作为解决方案。其次，我们将GAN损失整合到蒸馏过程中，用于区分生成样本与真实图像。这使得我们能够在真实数据上训练学生模型，从而缓解教师模型对真实分数估计的不完美性，并提升生成质量。最后，我们修改训练流程以支持多步采样。在此设置下，我们识别并解决了训练-推理输入不匹配问题，方法是在训练期间模拟推理时生成器的样本。综合来看，我们的改进在单步图像生成领域设立了新基准：在ImageNet-64x64上获得1.28的FID分数，在零样本COCO 2014上获得8.35的FID分数，尽管推理成本降低了500倍，但仍超越了原始教师模型。此外，我们展示了通过蒸馏SDXL，我们的方法能够生成百万像素级图像，在少步生成方法中展现出卓越的视觉质量。