Quantization stands as a pivotal technique for large language model (LLM) serving, yet it poses significant challenges particularly in achieving effective low-bit quantization. The limited numerical mapping makes the quantized model produce a non-trivial error, bringing out intolerable performance degration. This paper is anchored in the basic idea of model compression objectives, and delves into the layer-wise error distribution of LLMs during post-training quantization. Subsequently, we introduce ASER, an algorithm consisting of (1) Error Reconstruction: low-rank compensation for quantization error with LoRA-style matrices constructed by whitening SVD; (2) Activation Smoothing: outlier extraction to gain smooth activation and better error compensation. ASER is capable of quantizing typical LLMs to low-bit ones, particularly preserving accuracy even in W4A8 per-channel setup. Experimental results show that ASER is competitive among the state-of-the-art quantization algorithms, showing potential to activation quantization, with minor overhead.
翻译:量化是大语言模型(LLM)部署中的一项关键技术,但在实现有效的低位量化方面仍面临重大挑战。有限的数值映射导致量化模型产生显著误差,从而引发难以接受的性能下降。本文立足于模型压缩目标的基本思想,深入探究了LLM在训练后量化过程中逐层的误差分布。随后,我们提出了ASER算法,该算法包含两个核心组件:(1)误差重构:通过基于白化奇异值分解构建的LoRA风格矩阵,对量化误差进行低秩补偿;(2)激活平滑:通过离群值提取获得平滑的激活分布,以实现更优的误差补偿。ASER能够将典型的LLM量化为低位模型,特别是在W4A8逐通道量化配置下仍能保持精度。实验结果表明,ASER在当前最先进的量化算法中具有竞争力,在激活量化方面展现出潜力,且仅引入轻微的计算开销。