Pruning is a highly effective approach for compressing large language models (LLMs), significantly reducing inference latency. However, conventional training-free structured pruning methods often employ a heuristic metric that indiscriminately removes some attention heads across all pruning layers, without considering their positions within the network architecture. In this work, we propose a novel pruning algorithm that strategically prunes attention heads in the model's higher layers. Since the removal of attention heads can alter the magnitude of token representations, we introduce an adaptive rescaling parameter that calibrates the representation scale post-pruning to counteract this effect. We conduct comprehensive experiments on a wide range of LLMs, including LLaMA3.1-8B, Mistral-7B-v0.3, Qwen2-7B, and Gemma2-9B. Our evaluation includes both generation and discriminative tasks across 27 datasets. The results consistently demonstrate that our method outperforms existing structured pruning methods. This improvement is particularly notable in generation tasks, where our approach significantly outperforms existing baselines. Code is available at https://github.com/SongtaoLiu0823/HARP.
翻译:剪枝是压缩大型语言模型(LLMs)的一种高效方法,能显著降低推理延迟。然而,传统的免训练结构化剪枝方法通常采用启发式度量,不加区分地移除所有剪枝层中的部分注意力头,而未考虑它们在网络架构中的位置。在本工作中,我们提出了一种新颖的剪枝算法,策略性地剪除模型高层中的注意力头。由于移除注意力头会改变词元表示的幅度,我们引入了一个自适应重缩放参数,用于在剪枝后校准表示尺度以抵消此影响。我们在包括LLaMA3.1-8B、Mistral-7B-v0.3、Qwen2-7B和Gemma2-9B在内的多种LLMs上进行了全面实验。我们的评估涵盖了27个数据集的生成式和判别式任务。结果一致表明,我们的方法优于现有的结构化剪枝方法。这一改进在生成任务中尤为显著,我们的方法明显超越了现有基线。代码可在https://github.com/SongtaoLiu0823/HARP获取。