Recent advancements in diffusion models, particularly the trend of architectural transformation from UNet-based Diffusion to Diffusion Transformer (DiT), have significantly improved the quality and scalability of image synthesis. Despite the incredible generative quality, the large computational requirements of these large-scale models significantly hinder the deployments in real-world scenarios. Post-training Quantization (PTQ) offers a promising solution by compressing model sizes and speeding up inference for the pretrained models while eliminating model retraining. However, we have observed the existing PTQ frameworks exclusively designed for both ViT and conventional Diffusion models fall into biased quantization and result in remarkable performance degradation. In this paper, we find that the DiTs typically exhibit considerable variance in terms of both weight and activation, which easily runs out of the limited numerical representations. To address this issue, we devise Q-DiT, which seamlessly integrates three techniques: fine-grained quantization to manage substantial variance across input channels of weights and activations, an automatic search strategy to optimize the quantization granularity and mitigate redundancies, and dynamic activation quantization to capture the activation changes across timesteps. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of the proposed Q-DiT. Specifically, when quantizing DiT-XL/2 to W8A8 on ImageNet 256x256, Q-DiT achieves a remarkable reduction in FID by 1.26 compared to the baseline. Under a W4A8 setting, it maintains high fidelity in image generation, showcasing only a marginal increase in FID and setting a new benchmark for efficient, high-quality quantization in diffusion transformers. Code is available at \href{https://github.com/Juanerx/Q-DiT}{https://github.com/Juanerx/Q-DiT}.
翻译:近年来,扩散模型取得了显著进展,特别是架构从基于UNet的扩散模型向扩散Transformer(DiT)的转变,极大地提升了图像合成的质量与可扩展性。尽管这些大规模模型具有出色的生成质量,但其巨大的计算需求严重阻碍了在实际场景中的部署。训练后量化(PTQ)提供了一种有前景的解决方案,它能在无需重新训练模型的情况下,压缩模型大小并加速预训练模型的推理过程。然而,我们观察到,现有的专为ViT和传统扩散模型设计的PTQ框架存在量化偏差,导致性能显著下降。本文发现,DiT通常在权重和激活值方面表现出较大的方差,这很容易超出有限数值表示的范围。为解决此问题,我们设计了Q-DiT,它无缝集成了三项技术:细粒度量化,用于管理权重和激活值输入通道间的显著方差;自动搜索策略,用于优化量化粒度并减少冗余;以及动态激活量化,用于捕捉不同时间步的激活变化。在ImageNet数据集上进行的大量实验证明了所提出的Q-DiT的有效性。具体而言,在ImageNet 256x256上将DiT-XL/2量化为W8A8时,Q-DiT相比基线实现了FID显著降低1.26。在W4A8设置下,它保持了图像生成的高保真度,仅显示出FID的微小增加,并为扩散Transformer的高效、高质量量化设立了新的基准。代码发布于 \href{https://github.com/Juanerx/Q-DiT}{https://github.com/Juanerx/Q-DiT}。