Post-training quantization (PTQ) serves as a potent technique to accelerate the inference of large language models (LLMs). Nonetheless, existing works still necessitate a considerable number of floating-point (FP) operations during inference, including additional quantization and de-quantization, as well as non-linear operators such as RMSNorm and Softmax. This limitation hinders the deployment of LLMs on the edge and cloud devices. In this paper, we identify the primary obstacle to integer-only quantization for LLMs lies in the large fluctuation of activations across channels and tokens in both linear and non-linear operations. To address this issue, we propose I-LLM, a novel integer-only fully-quantized PTQ framework tailored for LLMs. Specifically, (1) we develop Fully-Smooth Block-Reconstruction (FSBR) to aggressively smooth inter-channel variations of all activations and weights. (2) to alleviate degradation caused by inter-token variations, we introduce a novel approach called Dynamic Integer-only MatMul (DI-MatMul). This method enables dynamic quantization in full-integer matrix multiplication by dynamically quantizing the input and outputs with integer-only operations. (3) we design DI-ClippedSoftmax, DI-Exp, and DI-Normalization, which utilize bit shift to execute non-linear operators efficiently while maintaining accuracy. The experiment shows that our I-LLM achieves comparable accuracy to the FP baseline and outperforms non-integer quantization methods. For example, I-LLM can operate at W4A4 with negligible loss of accuracy. To our knowledge, we are the first to bridge the gap between integer-only quantization and LLMs. We've published our code on anonymous.4open.science, aiming to contribute to the advancement of this field.
翻译:训练后量化(PTQ)是加速大语言模型(LLM)推理的有效技术。然而,现有方法在推理过程中仍需要大量浮点(FP)运算,包括额外的量化和反量化操作,以及RMSNorm和Softmax等非线性算子。这一限制阻碍了LLM在边缘和云端设备上的部署。本文指出,LLM纯整数量化的主要障碍在于线性和非线性运算中激活值在通道间和令牌间存在大幅波动。为解决此问题,我们提出了I-LLM——一种专为LLM设计的新型纯整数全量化PTQ框架。具体而言:(1)我们开发了全平滑块重建(FSBR)方法,以大幅平滑所有激活值和权重的通道间差异;(2)为缓解令牌间差异导致的性能退化,我们提出了动态纯整数矩阵乘法(DI-MatMul),该方法通过纯整数运算动态量化输入和输出,实现全整数矩阵乘法中的动态量化;(3)我们设计了DI-ClippedSoftmax、DI-Exp和DI-Normalization模块,利用位移操作高效执行非线性算子并保持精度。实验表明,I-LLM的精度与FP基线相当,并优于非整数量化方法。例如,I-LLM可在W4A4配置下运行且精度损失可忽略。据我们所知,我们是首个弥合纯整数量化与LLM间技术鸿沟的研究。我们已将代码发布于anonymous.4open.science,以期推动该领域的发展。