Rotating the activation and weight matrices to reduce the influence of outliers in large language models (LLMs) has recently attracted significant attention, particularly in the context of model quantization. Prior studies have shown that in low-precision quantization scenarios, such as 4-bit weights and 4-bit activations (W4A4), randomized Hadamard transforms can achieve significantly higher accuracy than randomized orthogonal transforms. Notably, the reason behind this phenomena remains unknown. In this paper, we find that these transformations show substantial improvement in eliminating outliers for common tokens and achieve similar quantization error. The primary reason for the accuracy difference lies in the fact that randomized Hadamard transforms can slightly reduce the quantization error for tokens with massive activations while randomized orthogonal transforms increase the quantization error. Due to the extreme rarity of these tokens and their critical impact on model accuracy, we consider this a long-tail optimization problem, and therefore construct a simple yet effective method: a weighted loss function. Additionally, we propose an optimization strategy for the rotation matrix that involves alternating optimization of quantization parameters while employing orthogonal Procrustes transforms to refine the rotation matrix. This makes the distribution of the rotated activation values more conducive to quantization, especially for tokens with massive activations. Our method enhances the Rotated LLMs by achieving dual free, Outlier-Free and Massive Activation-Free, dubbed as DFRot. Extensive experiments demonstrate the effectiveness and efficiency of DFRot. By tuning the rotation matrix using just a single sample, DFRot achieves a perplexity improvement of 0.25 and 0.21 on W4A4KV4 and W4A4KV16, respectively, for LLaMA3-8B, a model known for its quantization challenges.
翻译:旋转激活矩阵和权重矩阵以减少大语言模型(LLMs)中异常值的影响,近期引起了广泛关注,尤其是在模型量化领域。先前研究表明,在低精度量化场景下(例如4位权重和4位激活的W4A4量化),随机哈达玛变换比随机正交变换能实现显著更高的精度。值得注意的是,这一现象背后的原因尚不明确。本文发现,这些变换在消除常见词元的异常值方面表现出显著改进,并实现了相近的量化误差。精度差异的主要原因在于:随机哈达玛变换能略微减少具有大规模激活的词元的量化误差,而随机正交变换则会增加此类误差。由于这类词元极为罕见且对模型精度具有关键影响,我们将其视为长尾优化问题,并据此构建了一种简单而有效的方法:加权损失函数。此外,我们提出了一种旋转矩阵的优化策略,该策略在交替优化量化参数的同时,采用正交Procrustes变换来精炼旋转矩阵。这使得旋转后激活值的分布更有利于量化,特别是对于具有大规模激活的词元。我们的方法通过实现双重自由——无异常值与无大规模激活(称为DFRot)——增强了旋转大语言模型。大量实验证明了DFRot的有效性和高效性。仅使用单个样本调整旋转矩阵,DFRot在量化挑战较大的LLaMA3-8B模型上,于W4A4KV4和W4A4KV16配置下分别实现了0.25和0.21的困惑度提升。