Despite their ability to generate high-resolution and diverse images from text prompts, text-to-image diffusion models often suffer from slow iterative sampling processes. Model distillation is one of the most effective directions to accelerate these models. However, previous distillation methods fail to retain the generation quality while requiring a significant amount of images for training, either from real data or synthetically generated by the teacher model. In response to this limitation, we present a novel image-free distillation scheme named $\textbf{SwiftBrush}$. Drawing inspiration from text-to-3D synthesis, in which a 3D neural radiance field that aligns with the input prompt can be obtained from a 2D text-to-image diffusion prior via a specialized loss without the use of any 3D data ground-truth, our approach re-purposes that same loss for distilling a pretrained multi-step text-to-image model to a student network that can generate high-fidelity images with just a single inference step. In spite of its simplicity, our model stands as one of the first one-step text-to-image generators that can produce images of comparable quality to Stable Diffusion without reliance on any training image data. Remarkably, SwiftBrush achieves an FID score of $\textbf{16.67}$ and a CLIP score of $\textbf{0.29}$ on the COCO-30K benchmark, achieving competitive results or even substantially surpassing existing state-of-the-art distillation techniques.
翻译:尽管文本到图像扩散模型能够根据文本提示生成高分辨率且多样化的图像,但其迭代采样过程过于缓慢。模型蒸馏是加速这类模型的最有效方向之一。然而,现有的蒸馏方法在保持生成质量的同时,需要大量真实数据或教师模型合成的训练图像。针对这一局限,我们提出了一种名为$\textbf{SwiftBrush}$的无图像蒸馏方案。受文本到3D合成的启发——该领域通过专用损失函数,无需任何3D数据真值即可从2D文本到图像扩散先验中获得与输入提示对齐的3D神经辐射场——我们的方法重新利用该损失函数,将预训练的多步文本到图像模型蒸馏至一个学生网络,使其仅需单次推理步骤即可生成高保真图像。尽管方法简单,但我们的模型成为首批无需依赖训练图像数据即可生成与Stable Diffusion质量相当图像的单步文本到图像生成器之一。值得注意的是,SwiftBrush在COCO-30K基准测试上实现了$\textbf{16.67}$的FID分数和$\textbf{0.29}$的CLIP分数,达到或甚至大幅超越现有最先进蒸馏技术的性能。