SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. However, these models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. As a result, high-end GPUs and cloud-based inference are required to run diffusion models at scale. This is costly and has privacy implications, especially when user data is sent to a third party. To overcome these challenges, we present a generic approach that, for the first time, unlocks running text-to-image diffusion models on mobile devices in less than $2$ seconds. We achieve so by introducing efficient network architecture and improving step distillation. Specifically, we propose an efficient UNet by identifying the redundancy of the original model and reducing the computation of the image decoder via data distillation. Further, we enhance the step distillation by exploring training strategies and introducing regularization from classifier-free guidance. Our extensive experiments on MS-COCO show that our model with $8$ denoising steps achieves better FID and CLIP scores than Stable Diffusion v$1.5$ with $50$ steps. Our work democratizes content creation by bringing powerful text-to-image diffusion models to the hands of users.

翻译：文本到图像扩散模型能够从自然语言描述中生成令人惊叹的图像，其效果可与专业艺术家和摄影师的作品相媲美。然而，这些模型规模庞大，具有复杂的网络架构和数十次去噪迭代，导致计算成本高昂且运行缓慢。因此，大规模运行扩散模型需要高端GPU和基于云的推理。这不仅成本高昂，还存在隐私影响，尤其是在用户数据发送给第三方时。为了克服这些挑战，我们提出了一种通用方法，首次实现了在移动设备上以不到2秒的时间运行文本到图像扩散模型。我们通过引入高效的网络架构和改进步长蒸馏来实现这一目标。具体来说，我们通过识别原始模型中的冗余性提出了一种高效的UNet，并通过数据蒸馏减少了图像解码器的计算量。此外，我们通过探索训练策略并引入无分类器引导的正则化，进一步增强了步长蒸馏。我们在MS-COCO上的大量实验表明，我们的模型仅需8步去噪即可获得比Stable Diffusion v1.5（50步）更优的FID和CLIP分数。我们的工作通过将强大的文本到图像扩散模型交到用户手中，实现了内容创作的民主化。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日