Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc. While their performance is impressive, the computational footprint due to their vast number of parameters can be prohibitive. Existing solutions such as SparseGPT and Wanda attempt to alleviate this issue through weight pruning. However, their layer-wise approach results in significant perturbation to the model's output and requires meticulous hyperparameter tuning, such as the pruning rate, which can adversely affect overall model performance. To address this, this paper introduces a novel LLM pruning technique dubbed blockwise parameter-efficient sparsity allocation (BESA) by applying a blockwise reconstruction loss. In contrast to the typical layer-wise pruning techniques, BESA is characterized by two distinctive attributes: i) it targets the overall pruning error with respect to individual transformer blocks, and ii) it allocates layer-specific sparsity in a differentiable manner, both of which ensure reduced performance degradation after pruning. Our experiments show that BESA achieves state-of-the-art performance, efficiently pruning LLMs like LLaMA1, and LLaMA2 with 7B to 70B parameters on a single A100 GPU in just five hours. Code is available at https://github.com/OpenGVLab/LLMPrune-BESA.
翻译:大语言模型在文本摘要、文本问答等多项任务中展现出卓越性能。尽管其表现令人瞩目,但庞大参数量带来的计算负担可能令人望而却步。现有解决方案(如SparseGPT和Wanda)试图通过权重剪枝缓解该问题,然而其逐层处理方法会显著扰动模型输出,且需要精细调整剪枝率等超参数,可能对整体模型性能产生不利影响。为此,本文通过引入逐块重构损失,提出一种名为"块级参数高效稀疏分配"的新型大语言模型剪枝技术。与典型逐层剪枝技术不同,BESA具有两个显著特征:i)针对单个Transformer模块的整体剪枝误差进行优化;ii)以可微分方式分配层间稀疏度,二者共同确保剪枝后性能退化幅度降低。实验表明,BESA实现了业界最优性能,可在单张A100 GPU上仅用五小时高效剪枝LLaMA1、LLaMA2等具有7B至70B参数的大语言模型。代码开源地址:https://github.com/OpenGVLab/LLMPrune-BESA