Diffusion models have demonstrated remarkable capabilities in image synthesis and related generative tasks. Nevertheless, their practicality for low-latency real-world applications is constrained by substantial computational costs and latency issues. Quantization is a dominant way to compress and accelerate diffusion models, where post-training quantization (PTQ) and quantization-aware training (QAT) are two main approaches, each bearing its own properties. While PTQ exhibits efficiency in terms of both time and data usage, it may lead to diminished performance in low bit-width. On the other hand, QAT can alleviate performance degradation but comes with substantial demands on computational and data resources. To capitalize on the advantages while avoiding their respective drawbacks, we introduce a data-free and parameter-efficient fine-tuning framework for low-bit diffusion models, dubbed EfficientDM, to achieve QAT-level performance with PTQ-like efficiency. Specifically, we propose a quantization-aware variant of the low-rank adapter (QALoRA) that can be merged with model weights and jointly quantized to low bit-width. The fine-tuning process distills the denoising capabilities of the full-precision model into its quantized counterpart, eliminating the requirement for training data. We also introduce scale-aware optimization and employ temporal learned step-size quantization to further enhance performance. Extensive experimental results demonstrate that our method significantly outperforms previous PTQ-based diffusion models while maintaining similar time and data efficiency. Specifically, there is only a marginal 0.05 sFID increase when quantizing both weights and activations of LDM-4 to 4-bit on ImageNet 256x256. Compared to QAT-based methods, our EfficientDM also boasts a 16.2x faster quantization speed with comparable generation quality.
翻译:扩散模型在图像合成及相关生成任务中展现出卓越能力。然而,其在实际低延迟应用中的实用性受到高计算成本和延迟问题的制约。量化是压缩和加速扩散模型的主要方式,其中训练后量化(PTQ)和量化感知训练(QAT)是两种主要方法,各具特性。PTQ在时间和数据使用上具有高效性,但在低位宽下可能导致性能下降;而QAT虽能缓解性能退化,但对计算和数据资源的需求较高。为兼顾两者优势并规避各自缺陷,我们提出一种无数据且参数高效的低比特扩散模型微调框架,命名为EfficientDM,旨在以PTQ级效率实现QAT级性能。具体而言,我们设计了一种量化感知的低秩适配器变体(QALoRA),可与模型权重合并并联合量化至低位宽。微调过程将全精度模型的去噪能力蒸馏至其量化对应版本,从而消除对训练数据的需求。我们还引入尺度感知优化并采用时间步进式学习步长量化以进一步提升性能。大量实验结果表明,我们的方法在保持相似时间与数据效率的同时,显著优于先前基于PTQ的扩散模型。具体而言,在ImageNet 256x256上对LDM-4的权重和激活同时量化至4比特时,仅导致0.05 sFID的微小增加。与基于QAT的方法相比,我们的EfficientDM在生成质量相当的情况下,量化速度提升达16.2倍。