InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation

Diffusion models have revolutionized text-to-image generation with its exceptional quality and creativity. However, its multi-step sampling process is known to be slow, often requiring tens of inference steps to obtain satisfactory results. Previous attempts to improve its sampling speed and reduce computational costs through distillation have been unsuccessful in achieving a functional one-step model. In this paper, we explore a recent method called Rectified Flow, which, thus far, has only been applied to small datasets. The core of Rectified Flow lies in its \emph{reflow} procedure, which straightens the trajectories of probability flows, refines the coupling between noises and images, and facilitates the distillation process with student models. We propose a novel text-conditioned pipeline to turn Stable Diffusion (SD) into an ultra-fast one-step model, in which we find reflow plays a critical role in improving the assignment between noise and images. Leveraging our new pipeline, we create, to the best of our knowledge, the first one-step diffusion-based text-to-image generator with SD-level image quality, achieving an FID (Frechet Inception Distance) of $23.3$ on MS COCO 2017-5k, surpassing the previous state-of-the-art technique, progressive distillation, by a significant margin ($37.2$ $\rightarrow$ $23.3$ in FID). By utilizing an expanded network with 1.7B parameters, we further improve the FID to $22.4$. We call our one-step models \emph{InstaFlow}. On MS COCO 2014-30k, InstaFlow yields an FID of $13.1$ in just $0.09$ second, the best in $\leq 0.1$ second regime, outperforming the recent StyleGAN-T ($13.9$ in $0.1$ second). Notably, the training of InstaFlow only costs 199 A100 GPU days. Project page:~\url{https://github.com/gnobitab/InstaFlow}.

翻译：扩散模型以其卓越的质量和创造力革新了文本到图像生成领域。然而，其多步采样过程速度缓慢，通常需要数十次推理步骤才能获得令人满意的结果。此前通过蒸馏提高采样速度、降低计算成本的尝试，均未能实现功能完整的一步模型。本文探讨了一种名为Rectified Flow的新方法，此前该方法仅应用于小型数据集。Rectified Flow的核心在于其\textbf{重新流动}过程，该过程使概率流轨迹直线化，优化噪声与图像之间的耦合关系，并促进与学生模型的蒸馏过程。我们提出了一种新颖的文本条件化流水线，将稳定扩散模型转化为超快一步模型，并发现重新流动在改善噪声与图像分配中起关键作用。利用这一新流水线，我们首次实现了具有SD级图像质量的一步扩散文本到图像生成器，在MS COCO 2017-5k数据集上FID（弗雷歇初始距离）达到23.3，显著超越此前最优技术渐进蒸馏（FID从37.2降至23.3）。通过采用含17亿参数的扩展网络，我们进一步将FID优化至22.4。我们将该一步模型称为\textbf{InstaFlow}。在MS COCO 2014-30k数据集上，InstaFlow仅需0.09秒即可获得FID为13.1的结果，为0.1秒以内速度区间的最优性能，超越近期提出的StyleGAN-T（0.1秒内FID为13.9）。值得注意的是，InstaFlow的训练仅需199个A100 GPU天。项目主页：\url{https://github.com/gnobitab/InstaFlow}。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日