The colossal parameters and computational overhead of Large Language Models (LLMs) challenge their real-world applications. Network pruning, which targets unstructured or structured sparsity by removing redundant parameters, has recently been explored for LLM acceleration. Existing LLM pruning works focus on unstructured pruning, which typically requires special hardware support for a practical speed-up. In contrast, structured pruning can reduce latency on general devices. However, it remains a challenge to perform structured pruning efficiently and maintain performance, especially at high sparsity ratios. To this end, we introduce an efficient structured pruning framework named CFSP, which leverages both Coarse (interblock) and Fine-grained (intrablock) activation information as an importance criterion to guide pruning. The pruning is highly efficient, as it only requires one forward pass to compute feature activations. Specifically, we first allocate the sparsity budget across blocks based on their importance and then retain important weights within each block. In addition, we introduce a recovery fine-tuning strategy that adaptively allocates training overhead based on coarse-grained importance to further improve performance. Experimental results demonstrate that CFSP outperforms existing methods on diverse models across various sparsity budgets. Our code will be available at https://github.com/wyxscir/CFSP.
翻译:大语言模型(LLM)庞大的参数量和计算开销对其实际应用构成了挑战。网络剪枝通过移除冗余参数来实现非结构化或结构化稀疏化,近年来已被探索用于加速LLM。现有的LLM剪枝工作主要集中于非结构化剪枝,这通常需要特殊的硬件支持才能实现实际的速度提升。相比之下,结构化剪枝可以在通用设备上降低延迟。然而,如何高效地进行结构化剪枝并保持模型性能,尤其是在高稀疏率下,仍然是一个挑战。为此,我们提出了一种名为CFSP的高效结构化剪枝框架,该框架利用粗粒度(块间)和细粒度(块内)激活信息作为重要性准则来指导剪枝。该剪枝过程高度高效,仅需一次前向传播即可计算特征激活。具体而言,我们首先根据各模块的重要性分配稀疏度预算,然后在每个模块内保留重要权重。此外,我们引入了一种恢复微调策略,该策略基于粗粒度重要性自适应地分配训练开销,以进一步提升性能。实验结果表明,在不同稀疏度预算下,CFSP在多种模型上的表现均优于现有方法。我们的代码将在 https://github.com/wyxscir/CFSP 公开。