Diffusion transformers (DiTs) have exhibited remarkable performance in visual generation tasks, such as generating realistic images or videos based on textual instructions. However, larger model sizes and multi-frame processing for video generation lead to increased computational and memory costs, posing challenges for practical deployment on edge devices. Post-Training Quantization (PTQ) is an effective method for reducing memory costs and computational complexity. When quantizing diffusion transformers, we find that applying existing diffusion quantization methods designed for U-Net faces challenges in preserving quality. After analyzing the major challenges for quantizing diffusion transformers, we design an improved quantization scheme: "ViDiT-Q": Video and Image Diffusion Transformer Quantization) to address these issues. Furthermore, we identify highly sensitive layers and timesteps hinder quantization for lower bit-widths. To tackle this, we improve ViDiT-Q with a novel metric-decoupled mixed-precision quantization method (ViDiT-Q-MP). We validate the effectiveness of ViDiT-Q across a variety of text-to-image and video models. While baseline quantization methods fail at W8A8 and produce unreadable content at W4A8, ViDiT-Q achieves lossless W8A8 quantization. ViDiTQ-MP achieves W4A8 with negligible visual quality degradation, resulting in a 2.5x memory optimization and a 1.5x latency speedup.
翻译:扩散Transformer(DiT)在视觉生成任务中展现出卓越性能,例如根据文本指令生成逼真的图像或视频。然而,更大的模型规模以及视频生成所需的多帧处理导致计算与内存成本增加,为在边缘设备上的实际部署带来了挑战。训练后量化(PTQ)是降低内存成本与计算复杂度的有效方法。在对扩散Transformer进行量化时,我们发现直接应用为U-Net设计的现有扩散量化方法难以保持生成质量。在分析了量化扩散Transformer的主要挑战后,我们设计了一种改进的量化方案:"ViDiT-Q"(视频与图像扩散Transformer量化)以解决这些问题。此外,我们发现高度敏感的层和时序步长阻碍了更低比特宽度的量化。为此,我们通过一种新颖的度量解耦混合精度量化方法(ViDiT-Q-MP)对ViDiT-Q进行了改进。我们在多种文本到图像和视频模型上验证了ViDiT-Q的有效性。当基线量化方法在W8A8配置下失败并在W4A8配置下产生不可读内容时,ViDiT-Q实现了无损的W8A8量化。ViDiT-Q-MP以可忽略的视觉质量损失实现了W4A8量化,从而带来了2.5倍的内存优化和1.5倍的延迟加速。