Gradient-Based Post-Training Quantization: Challenging the Status Quo

Quantization has become a crucial step for the efficient deployment of deep neural networks, where floating point operations are converted to simpler fixed point operations. In its most naive form, it simply consists in a combination of scaling and rounding transformations, leading to either a limited compression rate or a significant accuracy drop. Recently, Gradient-based post-training quantization (GPTQ) methods appears to be constitute a suitable trade-off between such simple methods and more powerful, yet expensive Quantization-Aware Training (QAT) approaches, particularly when attempting to quantize LLMs, where scalability of the quantization process is of paramount importance. GPTQ essentially consists in learning the rounding operation using a small calibration set. In this work, we challenge common choices in GPTQ methods. In particular, we show that the process is, to a certain extent, robust to a number of variables (weight selection, feature augmentation, choice of calibration set). More importantly, we derive a number of best practices for designing more efficient and scalable GPTQ methods, regarding the problem formulation (loss, degrees of freedom, use of non-uniform quantization schemes) or optimization process (choice of variable and optimizer). Lastly, we propose a novel importance-based mixed-precision technique. Those guidelines lead to significant performance improvements on all the tested state-of-the-art GPTQ methods and networks (e.g. +6.819 points on ViT for 4-bit quantization), paving the way for the design of scalable, yet effective quantization methods.

翻译：量化已成为深度神经网络高效部署的关键步骤，其将浮点运算转换为更简单的定点运算。在最朴素的形式中，量化仅包含缩放和取整变换的组合，这会导致压缩率有限或精度显著下降。近年来，基于梯度的训练后量化方法（GPTQ）似乎在这类简单方法与更强大但成本高昂的量化感知训练（QAT）方法之间提供了合适的折衷方案，尤其是在尝试量化大语言模型（LLM）时，量化过程的可扩展性至关重要。GPTQ本质上通过使用少量校准集来学习取整操作。本文对GPTQ方法中的常见选择提出了质疑。具体而言，我们证明该过程在一定程度上对多种变量（权重选择、特征增强、校准集选择）具有鲁棒性。更重要的是，我们推导出一系列最佳实践，用于设计更高效、可扩展的GPTQ方法，这些问题涉及问题公式化（损失函数、自由度、非均匀量化方案的运用）或优化过程（变量与优化器的选择）。最后，我们提出了一种基于重要性的新型混合精度技术。这些指南在所有测试的最先进GPTQ方法与网络（例如，在ViT的4比特量化中提升+6.819个点）上均带来了显著的性能改进，为设计可扩展且有效的量化方法铺平了道路。