Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of tasks, yet their significant computational and memory requirements present major challenges for deployment. A common approach uses Taylor expansion on the loss function to estimate neuron importance. However, its reliance on one-hot cross entropy loss, a key limitation is that it narrowly assesses importance based only on the probability assigned to the single predicted next token, thereby ignoring the other potential predictions of the original model. An intuitive solution to address this is to employ self distillation criterion for importance evaluation. However, this approach introduces significant computational overhead by requiring a separate teacher model for supervision. To this end, we propose a simple but effective criterion, information entropy of the model's output distribution, to efficiently evaluate importance scores of neurons with Taylor pruning without requirement of additional teacher. Compared to plain cross entropy criterion, it provides a more holistic criterion for Taylor pruning to prune neurons with the least impact on the prediction of model in a global manner, thereby preserving the fidelity of the model's predictive capabilities. Experimental results on extensive zero-shot benchmarks demonstrate that our method consistently outperforms existing pruning methods across the LLaMA and Qwen series models. The source code and trained weights are availabel at https://github.com/visresearch/HFPrune.
翻译:大型语言模型(LLMs)在广泛的任务中展现出卓越的性能,但其巨大的计算和内存需求给实际部署带来了重大挑战。一种常见方法利用损失函数的泰勒展开来估计神经元重要性。然而,该方法依赖于独热交叉熵损失,其关键局限性在于仅根据分配给单个预测的下一个词符的概率来狭隘地评估重要性,从而忽略了原始模型的其他潜在预测。解决此问题的一个直观方案是采用自蒸馏准则进行重要性评估。但这种方法需要额外的教师模型进行监督,会引入显著的计算开销。为此,我们提出了一种简单而有效的准则——模型输出分布的信息熵,以在泰勒剪枝中高效评估神经元的重要性分数,而无需额外的教师模型。与普通的交叉熵准则相比,它为泰勒剪枝提供了一个更全面的准则,能够以全局方式剪枝对模型预测影响最小的神经元,从而更好地保持模型预测能力的保真度。在广泛的零样本基准测试上的实验结果表明,我们的方法在LLaMA和Qwen系列模型上均持续优于现有的剪枝方法。源代码及训练权重可在 https://github.com/visresearch/HFPrune 获取。