EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models

Diffusion models have demonstrated remarkable capabilities in image synthesis and related generative tasks. Nevertheless, their practicality for low-latency real-world applications is constrained by substantial computational costs and latency issues. Quantization is a dominant way to compress and accelerate diffusion models, where post-training quantization (PTQ) and quantization-aware training (QAT) are two main approaches, each bearing its own properties. While PTQ exhibits efficiency in terms of both time and data usage, it may lead to diminished performance in low bit-width. On the other hand, QAT can alleviate performance degradation but comes with substantial demands on computational and data resources. To capitalize on the advantages while avoiding their respective drawbacks, we introduce a data-free and parameter-efficient fine-tuning framework for low-bit diffusion models, dubbed EfficientDM, to achieve QAT-level performance with PTQ-like efficiency. Specifically, we propose a quantization-aware variant of the low-rank adapter (QALoRA) that can be merged with model weights and jointly quantized to low bit-width. The fine-tuning process distills the denoising capabilities of the full-precision model into its quantized counterpart, eliminating the requirement for training data. We also introduce scale-aware optimization and employ temporal learned step-size quantization to further enhance performance. Extensive experimental results demonstrate that our method significantly outperforms previous PTQ-based diffusion models while maintaining similar time and data efficiency. Specifically, there is only a marginal 0.05 sFID increase when quantizing both weights and activations of LDM-4 to 4-bit on ImageNet 256x256. Compared to QAT-based methods, our EfficientDM also boasts a 16.2x faster quantization speed with comparable generation quality.

翻译：扩散模型在图像合成及相关生成任务中展现出卓越能力。然而，其在实际低延迟应用中的实用性受到高计算成本和延迟问题的制约。量化是压缩和加速扩散模型的主要方式，其中训练后量化（PTQ）和量化感知训练（QAT）是两种主要方法，各具特性。PTQ在时间和数据使用上具有高效性，但在低位宽下可能导致性能下降；而QAT虽能缓解性能退化，但对计算和数据资源的需求较高。为兼顾两者优势并规避各自缺陷，我们提出一种无数据且参数高效的低比特扩散模型微调框架，命名为EfficientDM，旨在以PTQ级效率实现QAT级性能。具体而言，我们设计了一种量化感知的低秩适配器变体（QALoRA），可与模型权重合并并联合量化至低位宽。微调过程将全精度模型的去噪能力蒸馏至其量化对应版本，从而消除对训练数据的需求。我们还引入尺度感知优化并采用时间步进式学习步长量化以进一步提升性能。大量实验结果表明，我们的方法在保持相似时间与数据效率的同时，显著优于先前基于PTQ的扩散模型。具体而言，在ImageNet 256x256上对LDM-4的权重和激活同时量化至4比特时，仅导致0.05 sFID的微小增加。与基于QAT的方法相比，我们的EfficientDM在生成质量相当的情况下，量化速度提升达16.2倍。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日