Quantization has become a crucial step for the efficient deployment of deep neural networks, where floating point operations are converted to simpler fixed point operations. In its most naive form, it simply consists in a combination of scaling and rounding transformations, leading to either a limited compression rate or a significant accuracy drop. Recently, Gradient-based post-training quantization (GPTQ) methods appears to be constitute a suitable trade-off between such simple methods and more powerful, yet expensive Quantization-Aware Training (QAT) approaches, particularly when attempting to quantize LLMs, where scalability of the quantization process is of paramount importance. GPTQ essentially consists in learning the rounding operation using a small calibration set. In this work, we challenge common choices in GPTQ methods. In particular, we show that the process is, to a certain extent, robust to a number of variables (weight selection, feature augmentation, choice of calibration set). More importantly, we derive a number of best practices for designing more efficient and scalable GPTQ methods, regarding the problem formulation (loss, degrees of freedom, use of non-uniform quantization schemes) or optimization process (choice of variable and optimizer). Lastly, we propose a novel importance-based mixed-precision technique. Those guidelines lead to significant performance improvements on all the tested state-of-the-art GPTQ methods and networks (e.g. +6.819 points on ViT for 4-bit quantization), paving the way for the design of scalable, yet effective quantization methods.
翻译:量化已成为深度神经网络高效部署的关键步骤,其将浮点运算转换为更简单的定点运算。在最朴素的形式中,量化仅包含缩放和取整变换的组合,这会导致压缩率有限或精度显著下降。近年来,基于梯度的训练后量化方法(GPTQ)似乎在这类简单方法与更强大但成本高昂的量化感知训练(QAT)方法之间提供了合适的折衷方案,尤其是在尝试量化大语言模型(LLM)时,量化过程的可扩展性至关重要。GPTQ本质上通过使用少量校准集来学习取整操作。本文对GPTQ方法中的常见选择提出了质疑。具体而言,我们证明该过程在一定程度上对多种变量(权重选择、特征增强、校准集选择)具有鲁棒性。更重要的是,我们推导出一系列最佳实践,用于设计更高效、可扩展的GPTQ方法,这些问题涉及问题公式化(损失函数、自由度、非均匀量化方案的运用)或优化过程(变量与优化器的选择)。最后,我们提出了一种基于重要性的新型混合精度技术。这些指南在所有测试的最先进GPTQ方法与网络(例如,在ViT的4比特量化中提升+6.819个点)上均带来了显著的性能改进,为设计可扩展且有效的量化方法铺平了道路。