Diffusion models have recently achieved great success in synthesizing diverse and high-fidelity images. However, sampling speed and memory constraints remain a major barrier to the practical adoption of diffusion models as the generation process for these models can be slow due to the need for iterative noise estimation using complex neural networks. We propose a solution to this problem by compressing the noise estimation network to accelerate the generation process using post-training quantization (PTQ). While existing PTQ approaches have not been able to effectively deal with the changing output distributions of noise estimation networks in diffusion models over multiple time steps, we are able to formulate a PTQ method that is specifically designed to handle the unique multi-timestep structure of diffusion models with a data calibration scheme using data sampled from different time steps. Experimental results show that our proposed method is able to directly quantize full-precision diffusion models into 8-bit or 4-bit models while maintaining comparable performance in a training-free manner, achieving a FID change of at most 1.88. Our approach can also be applied to text-guided image generation, and for the first time we can run stable diffusion in 4-bit weights without losing much perceptual quality, as shown in Figure 5 and Figure 9.
翻译:扩散模型近期在合成多样且高保真图像方面取得了巨大成功。然而,采样速度与内存限制仍是扩散模型实际应用的主要障碍,因为这些模型的生成过程需要依赖复杂神经网络进行迭代噪声估计,导致速度缓慢。我们提出通过后训练量化(PTQ)压缩噪声估计网络以加速生成过程的解决方案。现有PTQ方法难以有效处理扩散模型中噪声估计网络在多时间步上的动态输出分布,而我们设计了专门针对扩散模型独特的多时间步结构的方法,并采用从不同时间步采样的数据进行校准。实验结果表明,我们的方法可直接将全精度扩散模型量化为8位或4位模型,且以无需训练的方式保持相当性能,FID变化不超过1.88。该方法还可应用于文本引导图像生成,首次实现在4位权重下运行稳定扩散而几乎不损失感知质量,如图5和图9所示。