Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers

Recent advancements in diffusion models, particularly the trend of architectural transformation from UNet-based Diffusion to Diffusion Transformer (DiT), have significantly improved the quality and scalability of image synthesis. Despite the incredible generative quality, the large computational requirements of these large-scale models significantly hinder the deployments in real-world scenarios. Post-training Quantization (PTQ) offers a promising solution by compressing model sizes and speeding up inference for the pretrained models while eliminating model retraining. However, we have observed the existing PTQ frameworks exclusively designed for both ViT and conventional Diffusion models fall into biased quantization and result in remarkable performance degradation. In this paper, we find that the DiTs typically exhibit considerable variance in terms of both weight and activation, which easily runs out of the limited numerical representations. To address this issue, we devise Q-DiT, which seamlessly integrates three techniques: fine-grained quantization to manage substantial variance across input channels of weights and activations, an automatic search strategy to optimize the quantization granularity and mitigate redundancies, and dynamic activation quantization to capture the activation changes across timesteps. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of the proposed Q-DiT. Specifically, when quantizing DiT-XL/2 to W8A8 on ImageNet 256x256, Q-DiT achieves a remarkable reduction in FID by 1.26 compared to the baseline. Under a W4A8 setting, it maintains high fidelity in image generation, showcasing only a marginal increase in FID and setting a new benchmark for efficient, high-quality quantization in diffusion transformers. Code is available at \href{https://github.com/Juanerx/Q-DiT}{https://github.com/Juanerx/Q-DiT}.

翻译：近年来，扩散模型取得了显著进展，特别是架构从基于UNet的扩散模型向扩散Transformer（DiT）的转变，极大地提升了图像合成的质量与可扩展性。尽管这些大规模模型具有出色的生成质量，但其巨大的计算需求严重阻碍了在实际场景中的部署。训练后量化（PTQ）提供了一种有前景的解决方案，它能在无需重新训练模型的情况下，压缩模型大小并加速预训练模型的推理过程。然而，我们观察到，现有的专为ViT和传统扩散模型设计的PTQ框架存在量化偏差，导致性能显著下降。本文发现，DiT通常在权重和激活值方面表现出较大的方差，这很容易超出有限数值表示的范围。为解决此问题，我们设计了Q-DiT，它无缝集成了三项技术：细粒度量化，用于管理权重和激活值输入通道间的显著方差；自动搜索策略，用于优化量化粒度并减少冗余；以及动态激活量化，用于捕捉不同时间步的激活变化。在ImageNet数据集上进行的大量实验证明了所提出的Q-DiT的有效性。具体而言，在ImageNet 256x256上将DiT-XL/2量化为W8A8时，Q-DiT相比基线实现了FID显著降低1.26。在W4A8设置下，它保持了图像生成的高保真度，仅显示出FID的微小增加，并为扩散Transformer的高效、高质量量化设立了新的基准。代码发布于 \href{https://github.com/Juanerx/Q-DiT}{https://github.com/Juanerx/Q-DiT}。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日