SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. However, these models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. As a result, high-end GPUs and cloud-based inference are required to run diffusion models at scale. This is costly and has privacy implications, especially when user data is sent to a third party. To overcome these challenges, we present a generic approach that, for the first time, unlocks running text-to-image diffusion models on mobile devices in less than $2$ seconds. We achieve so by introducing efficient network architecture and improving step distillation. Specifically, we propose an efficient UNet by identifying the redundancy of the original model and reducing the computation of the image decoder via data distillation. Further, we enhance the step distillation by exploring training strategies and introducing regularization from classifier-free guidance. Our extensive experiments on MS-COCO show that our model with $8$ denoising steps achieves better FID and CLIP scores than Stable Diffusion v$1.5$ with $50$ steps. Our work democratizes content creation by bringing powerful text-to-image diffusion models to the hands of users.

翻译：文本到图像扩散模型能从自然语言描述中创造出令人惊艳的图像，其质量可与专业艺术家和摄影师的作品媲美。然而，这些模型规模庞大，拥有复杂的网络架构和数十次去噪迭代，导致计算成本高昂且运行缓慢。因此，大规模部署扩散模型需要高端GPU和基于云的推理，这不仅成本高昂，还存在隐私问题——尤其是当用户数据被发送至第三方时。为解决这些挑战，我们首次提出一种通用方法，可在不到两秒内于移动设备上运行文本到图像扩散模型。我们通过引入高效的网络架构并改进步骤蒸馏来实现这一目标。具体而言，我们通过识别原始模型的冗余性来设计高效的UNet，并通过数据蒸馏减少图像解码器的计算量。此外，我们通过探索训练策略并引入无分类器引导的正则化来增强步骤蒸馏。在MS-COCO上的大量实验表明，我们的模型仅需8步去噪即可达到比50步的Stable Diffusion v1.5更优的FID和CLIP分数。这项工作通过将强大的文本到图像扩散模型交到用户手中，推动了内容创作的民主化。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

手册《兵棋推演：工具、技术和程序》33页slides，Connections UK – Wargaming for Professionals

专知会员服务

42+阅读 · 2022年10月10日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日