Large language models (LLMs) have grown significantly in scale, leading to a critical need for efficient model pruning techniques. Existing post-training pruning techniques primarily focus on measuring weight importance on converged dense models to determine salient weights to retain. However, they often overlook the changes in weight importance during the pruning process, which can lead to performance degradation in the pruned models. To address this issue, we present LLM-Barber (Block-Aware Rebuilder for Sparsity Mask in One-Shot), a novel one-shot pruning framework that rebuilds the sparsity mask of pruned models without any retraining or weight reconstruction. LLM-Barber incorporates block-aware error optimization across Self-Attention and MLP blocks, ensuring global performance optimization. Inspired by the recent discovery of prominent outliers in LLMs, LLM-Barber introduces an innovative pruning metric that identifies weight importance using weights multiplied by gradients. Our experiments show that LLM-Barber can efficiently prune models like LLaMA and OPT families with 7B to 13B parameters on a single A100 GPU in just 30 minutes, achieving state-of-the-art results in both perplexity and zero-shot performance across various language benchmarks. Code is available at https://github.com/YupengSu/LLM-Barber.
翻译:大语言模型(LLM)的规模显著增长,使得高效的模型剪枝技术变得至关重要。现有的训练后剪枝技术主要侧重于衡量已收敛稠密模型中的权重重要性,以确定需要保留的关键权重。然而,这些方法往往忽视了剪枝过程中权重重要性的动态变化,这可能导致剪枝后模型性能下降。为解决此问题,我们提出了LLM-Barber(一次性稀疏掩码块感知重建器),这是一种无需任何重训练或权重重构即可重建剪枝模型稀疏掩码的新型一次性剪枝框架。LLM-Barber通过跨自注意力(Self-Attention)与多层感知机(MLP)模块的块感知误差优化,确保全局性能最优化。受近期关于大语言模型中显著离群值发现的启发,LLM-Barber引入了一种创新的剪枝度量方法,通过权重与梯度的乘积来识别权重重要性。实验表明,LLM-Barber能在单张A100 GPU上仅用30分钟高效剪枝参数量为7B至13B的LLaMA和OPT系列模型,在多种语言基准测试的困惑度与零样本性能评估中均达到最先进水平。代码公开于 https://github.com/YupengSu/LLM-Barber。