Diffusion transformers (DiTs) have exhibited remarkable performance in visual generation tasks, such as generating realistic images or videos based on textual instructions. However, larger model sizes and multi-frame processing for video generation lead to increased computational and memory costs, posing challenges for practical deployment on edge devices. Post-Training Quantization (PTQ) is an effective method for reducing memory costs and computational complexity. When quantizing diffusion transformers, we find that applying existing diffusion quantization methods designed for U-Net faces challenges in preserving quality. After analyzing the major challenges for quantizing diffusion transformers, we design an improved quantization scheme: "ViDiT-Q": Video and Image Diffusion Transformer Quantization) to address these issues. Furthermore, we identify highly sensitive layers and timesteps hinder quantization for lower bit-widths. To tackle this, we improve ViDiT-Q with a novel metric-decoupled mixed-precision quantization method (ViDiT-Q-MP). We validate the effectiveness of ViDiT-Q across a variety of text-to-image and video models. While baseline quantization methods fail at W8A8 and produce unreadable content at W4A8, ViDiT-Q achieves lossless W8A8 quantization. ViDiTQ-MP achieves W4A8 with negligible visual quality degradation, resulting in a 2.5x memory optimization and a 1.5x latency speedup.
翻译:扩散Transformer(DiT)在视觉生成任务(例如基于文本指令生成逼真图像或视频)中展现出卓越性能。然而,更大的模型规模及视频生成所需的多帧处理导致计算与内存成本显著增加,为在边缘设备上的实际部署带来挑战。训练后量化(PTQ)是降低内存成本与计算复杂度的有效方法。在量化扩散Transformer时,我们发现直接应用现有针对U-Net设计的扩散量化方法难以保持生成质量。通过系统分析扩散Transformer量化的主要挑战,我们设计了一种改进的量化方案:"ViDiT-Q"(视频与图像扩散Transformer量化)以应对这些问题。此外,我们发现高度敏感的神经网络层与时间步是阻碍更低比特宽度量化的关键因素。为此,我们进一步提出一种新颖的度量解耦混合精度量化方法(ViDiT-Q-MP)以增强ViDiT-Q。我们在多种文本到图像及文本到视频模型上验证了ViDiT-Q的有效性。当基线量化方法在W8A8配置下已失效,并在W4A8配置下生成不可辨识内容时,ViDiT-Q能够实现无损的W8A8量化。ViDiT-Q-MP则能以可忽略的视觉质量损失实现W4A8量化,达成2.5倍的内存优化与1.5倍的延迟加速。