Diffusion models have achieved great success in image synthesis through iterative noise estimation using deep neural networks. However, the slow inference, high memory consumption, and computation intensity of the noise estimation model hinder the efficient adoption of diffusion models. Although post-training quantization (PTQ) is considered a go-to compression method for other tasks, it does not work out-of-the-box on diffusion models. We propose a novel PTQ method specifically tailored towards the unique multi-timestep pipeline and model architecture of the diffusion models, which compresses the noise estimation network to accelerate the generation process. We identify the key difficulty of diffusion model quantization as the changing output distributions of noise estimation networks over multiple time steps and the bimodal activation distribution of the shortcut layers within the noise estimation network. We tackle these challenges with timestep-aware calibration and split shortcut quantization in this work. Experimental results show that our proposed method is able to quantize full-precision unconditional diffusion models into 4-bit while maintaining comparable performance (small FID change of at most 2.34 compared to >100 for traditional PTQ) in a training-free manner. Our approach can also be applied to text-guided image generation, where we can run stable diffusion in 4-bit weights with high generation quality for the first time.
翻译:扩散模型通过深度神经网络的迭代噪声估计在图像合成领域取得了巨大成功。然而,噪声估计模型的推理速度慢、内存消耗高和计算强度大,阻碍了扩散模型的高效应用。尽管训练后量化(PTQ)被视为其他任务的首选压缩方法,但在扩散模型上无法直接发挥作用。我们提出了一种新颖的PTQ方法,专门针对扩散模型独特的多时间步流水线和模型架构设计,通过压缩噪声估计网络来加速生成过程。我们识别出扩散模型量化的主要困难在于:噪声估计网络在多个时间步上输出分布的变化性,以及噪声估计网络中捷径层的双峰激活分布。在本工作中,我们通过时间步感知校准和拆分捷径量化来应对这些挑战。实验结果表明,我们的方法能够将全精度无条件扩散模型量化至4位,同时以无需训练的方式保持相当的性能(FID变化最多仅2.34,而传统PTQ超过100)。我们的方法还可应用于文本引导图像生成,首次实现了以4位权重的稳定扩散运行并保持高生成质量。