Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. However, these models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. As a result, high-end GPUs and cloud-based inference are required to run diffusion models at scale. This is costly and has privacy implications, especially when user data is sent to a third party. To overcome these challenges, we present a generic approach that, for the first time, unlocks running text-to-image diffusion models on mobile devices in less than $2$ seconds. We achieve so by introducing efficient network architecture and improving step distillation. Specifically, we propose an efficient UNet by identifying the redundancy of the original model and reducing the computation of the image decoder via data distillation. Further, we enhance the step distillation by exploring training strategies and introducing regularization from classifier-free guidance. Our extensive experiments on MS-COCO show that our model with $8$ denoising steps achieves better FID and CLIP scores than Stable Diffusion v$1.5$ with $50$ steps. Our work democratizes content creation by bringing powerful text-to-image diffusion models to the hands of users.
翻译:文本到图像扩散模型能够从自然语言描述中生成令人惊叹的图像,其效果可与专业艺术家和摄影师的作品相媲美。然而,这些模型规模庞大,具有复杂的网络架构和数十次去噪迭代,导致计算成本高昂且运行缓慢。因此,大规模运行扩散模型需要高端GPU和基于云的推理。这不仅成本高昂,还存在隐私影响,尤其是在用户数据发送给第三方时。为了克服这些挑战,我们提出了一种通用方法,首次实现了在移动设备上以不到2秒的时间运行文本到图像扩散模型。我们通过引入高效的网络架构和改进步长蒸馏来实现这一目标。具体来说,我们通过识别原始模型中的冗余性提出了一种高效的UNet,并通过数据蒸馏减少了图像解码器的计算量。此外,我们通过探索训练策略并引入无分类器引导的正则化,进一步增强了步长蒸馏。我们在MS-COCO上的大量实验表明,我们的模型仅需8步去噪即可获得比Stable Diffusion v1.5(50步)更优的FID和CLIP分数。我们的工作通过将强大的文本到图像扩散模型交到用户手中,实现了内容创作的民主化。