Large language models have demonstrated promising capabilities upon scaling up parameters. However, serving large language models incurs substantial computation and memory movement costs due to their large scale. Quantization methods have been employed to reduce service costs and latency. Nevertheless, outliers in activations hinder the development of INT4 weight-activation quantization. Existing approaches separate outliers and normal values into two matrices or migrate outliers from activations to weights, suffering from high latency or accuracy degradation. Based on observing activations from large language models, outliers can be classified into channel-wise and spike outliers. In this work, we propose Rotated Runtime Smooth (RRS), a plug-and-play activation smoother for quantization, consisting of Runtime Smooth and the Rotation operation. Runtime Smooth (RS) is introduced to eliminate channel-wise outliers by smoothing activations with channel-wise maximums during runtime. The rotation operation can narrow the gap between spike outliers and normal values, alleviating the effect of victims caused by channel-wise smoothing. The proposed method outperforms the state-of-the-art method in the LLaMA and Qwen families and improves WikiText-2 perplexity from 57.33 to 6.66 for INT4 inference.
翻译:大型语言模型在参数规模扩大后展现出令人期待的能力。然而,由于其巨大规模,部署大型语言模型会产生大量的计算和内存移动成本。量化方法已被用于降低服务成本和延迟。然而,激活值中的异常值阻碍了INT4权重-激活量化的发展。现有方法将异常值和正常值分离到两个矩阵中,或将异常值从激活值迁移到权重中,但存在高延迟或精度下降的问题。基于对大型语言模型激活值的观察,异常值可分为通道级异常值和尖峰异常值。在本工作中,我们提出了旋转运行时平滑(RRS),一种即插即用的量化激活平滑器,由运行时平滑和旋转操作组成。运行时平滑(RS)通过在运行时使用通道级最大值平滑激活值来消除通道级异常值。旋转操作可以缩小尖峰异常值与正常值之间的差距,减轻通道级平滑引起的受害者效应。所提方法在LLaMA和Qwen系列模型中优于现有最先进方法,并将INT4推理的WikiText-2困惑度从57.33提升至6.66。