In this paper, we introduce a new post-training compression paradigm for Large Language Models (LLMs) to facilitate their wider adoption. We delve into LLM weight low-rank factorization, and find that the challenges of this task stem from the outlier phenomenon in the LLM activations and the sensitivity difference among various kinds of layers. To address these issues, we propose a training-free approach called Activation-aware Singular Value Decomposition (ASVD). Specifically, ASVD manages activation outliers by scaling the weight matrix based on the activation distribution, thereby enhancing decomposition accuracy. Additionally, we propose an efficient iterative calibration process to optimize layer-specific decomposition by addressing the varying sensitivity of different LLM layers. ASVD can compress a network by 10-20%, without compromising the performance of LLMs. Based on the success of the low-rank decomposition of projection matrices in the self-attention module, we further introduce ASVD to compress the KV cache. By reducing the channel dimension of KV activations, memory requirements for KV cache can be largely reduced. Thanks to the 50-75% reduction in the rank of the KV projection matrices, ASVD can further achieve 50% KV cache reductions without performance drop in a training-free manner.
翻译:本文提出一种新的大语言模型(LLM)训练后压缩范式,以促进其更广泛的应用。我们深入研究了LLM权重的低秩分解,发现该任务的挑战源于LLM激活中的离群值现象以及不同类型层之间的敏感性差异。为解决这些问题,我们提出了一种无需训练的方法——激活感知奇异值分解(ASVD)。具体而言,ASVD通过基于激活分布缩放权重矩阵来处理激活离群值,从而提高分解精度。此外,我们提出一种高效的迭代校准过程,通过处理不同LLM层的敏感性差异来优化逐层分解。ASVD能够将网络压缩10-20%,同时保持LLM性能不受影响。基于自注意力模块中投影矩阵低秩分解的成功实践,我们进一步引入ASVD来压缩KV缓存。通过降低KV激活的通道维度,可大幅减少KV缓存的内存需求。得益于KV投影矩阵秩降低50-75%,ASVD能够以无需训练的方式进一步实现50%的KV缓存压缩,且不产生性能损失。