Diffusion models have revolutionized text-to-image generation with its exceptional quality and creativity. However, its multi-step sampling process is known to be slow, often requiring tens of inference steps to obtain satisfactory results. Previous attempts to improve its sampling speed and reduce computational costs through distillation have been unsuccessful in achieving a functional one-step model. In this paper, we explore a recent method called Rectified Flow, which, thus far, has only been applied to small datasets. The core of Rectified Flow lies in its \emph{reflow} procedure, which straightens the trajectories of probability flows, refines the coupling between noises and images, and facilitates the distillation process with student models. We propose a novel text-conditioned pipeline to turn Stable Diffusion (SD) into an ultra-fast one-step model, in which we find reflow plays a critical role in improving the assignment between noise and images. Leveraging our new pipeline, we create, to the best of our knowledge, the first one-step diffusion-based text-to-image generator with SD-level image quality, achieving an FID (Frechet Inception Distance) of $23.3$ on MS COCO 2017-5k, surpassing the previous state-of-the-art technique, progressive distillation, by a significant margin ($37.2$ $\rightarrow$ $23.3$ in FID). By utilizing an expanded network with 1.7B parameters, we further improve the FID to $22.4$. We call our one-step models \emph{InstaFlow}. On MS COCO 2014-30k, InstaFlow yields an FID of $13.1$ in just $0.09$ second, the best in $\leq 0.1$ second regime, outperforming the recent StyleGAN-T ($13.9$ in $0.1$ second). Notably, the training of InstaFlow only costs 199 A100 GPU days. Project page:~\url{https://github.com/gnobitab/InstaFlow}.
翻译:扩散模型以其卓越的质量和创造力革新了文本到图像生成领域。然而,其多步采样过程速度缓慢,通常需要数十次推理步骤才能获得令人满意的结果。此前通过蒸馏提高采样速度、降低计算成本的尝试,均未能实现功能完整的一步模型。本文探讨了一种名为Rectified Flow的新方法,此前该方法仅应用于小型数据集。Rectified Flow的核心在于其\textbf{重新流动}过程,该过程使概率流轨迹直线化,优化噪声与图像之间的耦合关系,并促进与学生模型的蒸馏过程。我们提出了一种新颖的文本条件化流水线,将稳定扩散模型转化为超快一步模型,并发现重新流动在改善噪声与图像分配中起关键作用。利用这一新流水线,我们首次实现了具有SD级图像质量的一步扩散文本到图像生成器,在MS COCO 2017-5k数据集上FID(弗雷歇初始距离)达到23.3,显著超越此前最优技术渐进蒸馏(FID从37.2降至23.3)。通过采用含17亿参数的扩展网络,我们进一步将FID优化至22.4。我们将该一步模型称为\textbf{InstaFlow}。在MS COCO 2014-30k数据集上,InstaFlow仅需0.09秒即可获得FID为13.1的结果,为0.1秒以内速度区间的最优性能,超越近期提出的StyleGAN-T(0.1秒内FID为13.9)。值得注意的是,InstaFlow的训练仅需199个A100 GPU天。项目主页:\url{https://github.com/gnobitab/InstaFlow}。