Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. However, these models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. As a result, high-end GPUs and cloud-based inference are required to run diffusion models at scale. This is costly and has privacy implications, especially when user data is sent to a third party. To overcome these challenges, we present a generic approach that, for the first time, unlocks running text-to-image diffusion models on mobile devices in less than $2$ seconds. We achieve so by introducing efficient network architecture and improving step distillation. Specifically, we propose an efficient UNet by identifying the redundancy of the original model and reducing the computation of the image decoder via data distillation. Further, we enhance the step distillation by exploring training strategies and introducing regularization from classifier-free guidance. Our extensive experiments on MS-COCO show that our model with $8$ denoising steps achieves better FID and CLIP scores than Stable Diffusion v$1.5$ with $50$ steps. Our work democratizes content creation by bringing powerful text-to-image diffusion models to the hands of users.
翻译:文本到图像扩散模型能从自然语言描述中创造出令人惊艳的图像,其质量可与专业艺术家和摄影师的作品媲美。然而,这些模型规模庞大,拥有复杂的网络架构和数十次去噪迭代,导致计算成本高昂且运行缓慢。因此,大规模部署扩散模型需要高端GPU和基于云的推理,这不仅成本高昂,还存在隐私问题——尤其是当用户数据被发送至第三方时。为解决这些挑战,我们首次提出一种通用方法,可在不到两秒内于移动设备上运行文本到图像扩散模型。我们通过引入高效的网络架构并改进步骤蒸馏来实现这一目标。具体而言,我们通过识别原始模型的冗余性来设计高效的UNet,并通过数据蒸馏减少图像解码器的计算量。此外,我们通过探索训练策略并引入无分类器引导的正则化来增强步骤蒸馏。在MS-COCO上的大量实验表明,我们的模型仅需8步去噪即可达到比50步的Stable Diffusion v1.5更优的FID和CLIP分数。这项工作通过将强大的文本到图像扩散模型交到用户手中,推动了内容创作的民主化。