Network Pruning is a promising way to address the huge computing resource demands of the deployment and inference of Large Language Models (LLMs). Retraining-free is important for LLMs' pruning methods. However, almost all of the existing retraining-free pruning approaches for LLMs focus on unstructured pruning, which requires specific hardware support for acceleration. In this paper, we propose a novel retraining-free structured pruning framework for LLMs, named FLAP (FLuctuation-based Adaptive Structured Pruning). It is hardware-friendly by effectively reducing storage and enhancing inference speed. For effective structured pruning of LLMs, we highlight three critical elements that demand the utmost attention: formulating structured importance metrics, adaptively searching the global compressed model, and implementing compensation mechanisms to mitigate performance loss. First, FLAP determines whether the output feature map is easily recoverable when a column of weight is removed, based on the fluctuation pruning metric. Then it standardizes the importance scores to adaptively determine the global compressed model structure. At last, FLAP adds additional bias terms to recover the output feature maps using the baseline values. We thoroughly evaluate our approach on a variety of language benchmarks. Without any retraining, our method significantly outperforms the state-of-the-art methods, including LLM-Pruner and the extension of Wanda in structured pruning. The code is released at https://github.com/CASIA-IVA-Lab/FLAP.
翻译:网络剪枝是应对大型语言模型(LLMs)部署与推理中巨大计算资源需求的有效途径。免重训练对于LLMs的剪枝方法至关重要。然而,现有面向LLMs的免重训练剪枝方法几乎全部聚焦于非结构化剪枝,这类方法需要特定硬件支持才能实现加速。本文提出一种新颖的面向LLMs的免重训练结构化剪枝框架——FLAP(基于波动性的自适应结构化剪枝)。该框架通过有效减少存储并提升推理速度,具有硬件友好性。为实现LLMs的有效结构化剪枝,我们强调三个亟需关注的核心要素:构建结构化重要性度量指标、自适应搜索全局压缩模型、以及实施性能损失补偿机制。首先,FLAP基于波动性剪枝度量,判定移除某列权重时输出特征图的可恢复性。随后,它将重要性分数标准化,以自适应确定全局压缩模型结构。最后,FLAP通过添加额外偏置项,利用基准值恢复输出特征图。我们在多个语言基准测试上全面评估了该方法。在无需任何重训练的情况下,我们的方法显著优于包括LLM-Pruner和结构化剪枝扩展版Wanda在内的现有最优方法。相关代码已开源至https://github.com/CASIA-IVA-Lab/FLAP。