Diffusion-based text-to-image generation models trained on extensive text-image pairs have shown the capacity to generate photorealistic images consistent with textual descriptions. However, a significant limitation of these models is their slow sample generation, which requires iterative refinement through the same network. In this paper, we enhance Score identity Distillation (SiD) by developing long and short classifier-free guidance (LSG) to efficiently distill pretrained Stable Diffusion models without using real training data. SiD aims to optimize a model-based explicit score matching loss, utilizing a score-identity-based approximation alongside the proposed LSG for practical computation. By training exclusively with fake images synthesized with its one-step generator, SiD equipped with LSG rapidly improves FID and CLIP scores, achieving state-of-the-art FID performance while maintaining a competitive CLIP score. Specifically, its data-free distillation of Stable Diffusion 1.5 achieves a record low FID of 8.15 on the COCO-2014 validation set, with a CLIP score of 0.304 at an LSG scale of 1.5, and a FID of 9.56 with a CLIP score of 0.313 at an LSG scale of 2. We will make our PyTorch implementation and distilled Stable Diffusion one-step generators available at https://github.com/mingyuanzhou/SiD-LSG
翻译:基于扩散的文本到图像生成模型经过大规模文本-图像对训练后,已展现出生成与文本描述一致、具有照片级真实感图像的能力。然而,这些模型的一个显著局限在于其采样生成速度缓慢,需要通过同一网络进行迭代优化。本文通过开发长短无分类器引导(LSG)来增强分数恒等蒸馏(SiD),从而在不使用真实训练数据的情况下高效蒸馏预训练的Stable Diffusion模型。SiD旨在优化基于模型的显式分数匹配损失,利用基于分数恒等的近似方法以及所提出的LSG进行实际计算。通过仅使用其一步生成器合成的虚假图像进行训练,配备LSG的SiD能够快速提升FID和CLIP分数,在保持具有竞争力的CLIP分数的同时,实现了最先进的FID性能。具体而言,其对Stable Diffusion 1.5的无数据蒸馏在COCO-2014验证集上取得了创纪录的低FID值8.15(LSG尺度为1.5时CLIP分数为0.304),以及在LSG尺度为2时FID为9.56(CLIP分数为0.313)。我们将在https://github.com/mingyuanzhou/SiD-LSG 上公开我们的PyTorch实现以及蒸馏后的Stable Diffusion一步生成器。