Diffusion models have achieved great success in synthesizing diverse and high-fidelity images. However, sampling speed and memory constraints remain a major barrier to the practical adoption of diffusion models, since the generation process for these models can be slow due to the need for iterative noise estimation using compute-intensive neural networks. We propose to tackle this problem by compressing the noise estimation network to accelerate the generation process through post-training quantization (PTQ). While existing PTQ approaches have not been able to effectively deal with the changing output distributions of noise estimation networks in diffusion models over multiple time steps, we are able to formulate a PTQ method that is specifically designed to handle the unique multi-timestep structure of diffusion models with a data calibration scheme using data sampled from different time steps. Experimental results show that our proposed method is able to directly quantize full-precision diffusion models into 8-bit or 4-bit models while maintaining comparable performance in a training-free manner, achieving a FID change of at most 1.88. Our approach can also be applied to text-guided image generation, and for the first time we can run stable diffusion in 4-bit weights without losing much perceptual quality, as shown in Figure 5 and Figure 9.
翻译:扩散模型在合成多样且高保真图像方面取得了巨大成功。然而,采样速度和内存限制仍是扩散模型实际应用的主要障碍,因为此类模型的生成过程需要利用计算密集型神经网络进行迭代噪声估计,导致速度缓慢。我们提出通过训练后量化(PTQ)压缩噪声估计网络以加速生成过程。现有PTQ方法难以有效应对扩散模型中噪声估计网络在多时间步长下持续变化的输出分布,而我们提出了一种专门针对扩散模型独特的多时间步结构设计的PTQ方法,该方法采用从不同时间步采样数据的数据校准方案。实验结果表明,我们的方法无需重新训练即可将全精度扩散模型直接量化为8位或4位模型,同时保持可比的性能,FID变化不超过1.88。该方法还可应用于文本引导图像生成,并且我们首次实现了在4位权重下运行稳定扩散而不显著损失感知质量,如图5和图9所示。