SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

Dongting Hu,Jierun Chen,Xijie Huang,Huseyin Coskun,Arpit Sahni,Aarush Gupta,Anujraaj Goyal,Dishani Lahiri,Rajesh Singh,Yerlan Idelbayev,Junli Cao,Yanyu Li,Kwang-Ting Cheng,S. -H. Gary Chan,Mingming Gong,Sergey Tulyakov,Anil Kag,Yanwu Xu,Jian Ren

Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. First, we systematically examine the design choices of the network architecture to reduce model parameters and latency, while ensuring high-quality generation. Second, to further improve generation quality, we employ cross-architecture knowledge distillation from a much larger model, using a multi-level approach to guide the training of our model from scratch. Third, we enable a few-step generation by integrating adversarial guidance with knowledge distillation. For the first time, our model SnapGen, demonstrates the generation of 1024x1024 px images on a mobile device around 1.4 seconds. On ImageNet-1K, our model, with only 372M parameters, achieves an FID of 2.06 for 256x256 px generation. On T2I benchmarks (i.e., GenEval and DPG-Bench), our model with merely 379M parameters, surpasses large-scale models with billions of parameters at a significantly smaller size (e.g., 7x smaller than SDXL, 14x smaller than IF-XL).

翻译：现有的文本到图像（T2I）扩散模型存在若干局限性，包括模型尺寸庞大、运行速度慢以及在移动设备上生成质量低。本文旨在通过开发一种极其小巧且快速的T2I模型来解决所有这些挑战，该模型能够在移动平台上生成高分辨率、高质量的图像。我们提出了几种技术来实现这一目标。首先，我们系统地审视了网络架构的设计选择，以减少模型参数和延迟，同时确保高质量的生成。其次，为了进一步提升生成质量，我们采用了一种跨架构知识蒸馏方法，从一个更大的模型中，通过多层次的方法来指导我们模型从零开始的训练。第三，我们通过将对抗性引导与知识蒸馏相结合，实现了少步数生成。我们的模型SnapGen首次在移动设备上实现了约1.4秒内生成1024x1024像素的图像。在ImageNet-1K数据集上，我们的模型仅拥有3.72亿参数，在生成256x256像素图像时取得了2.06的FID分数。在T2I基准测试（即GenEval和DPG-Bench）上，我们仅拥有3.79亿参数的模型，在显著更小的尺寸下（例如，比SDXL小7倍，比IF-XL小14倍），超越了拥有数十亿参数的大规模模型。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日