Large Language Models (LLMs), such as LLaMA and T5, have shown exceptional performance across various tasks through fine-tuning. Although low-rank adaption (LoRA) has emerged to cheaply fine-tune these LLMs on downstream tasks, their deployment is still hindered by the vast model scale and computational costs. Post-training model pruning offers a way to compress LLMs. However, the current pruning methods designed for LLMs are not compatible with LoRA. This is due to their utilization of unstructured pruning on LLMs, impeding the merging of LoRA weights, or their dependence on the gradients of pre-trained weights to guide pruning, which can impose significant memory overhead. To this end, we propose LoRAPrune, a new framework that delivers an accurate structured pruned model in a highly memory-efficient manner. Specifically, we first design a LoRA-guided pruning criterion, which uses the weights and gradients of LoRA, rather than the gradients of pre-trained weights for importance estimation. We subsequently integrate this criterion into an iterative pruning process, effectively removing redundant channels and heads. Extensive experimental results demonstrate the superior performance of our LoRAPrune over existing approaches on the LLaMA series models. At a 50\% compression rate, LoRAPrune demonstrates superior performance over LLM-Pruner, achieving a reduction in perplexity by 4.81 on WikiText2 and 3.46 on PTB, while also decreasing memory usage by 52.6%. Besides, LoRAPrune also matches semi-structural pruning across multiple LLMs, proving its wide applicability. The code is available at https://github.com/aim-uofa/LoRAPrune.
翻译:大型语言模型(如LLaMA和T5)通过微调在各种任务中展现出卓越性能。尽管低秩自适应(LoRA)技术已出现,能够以低成本在下游任务中微调这些大型语言模型,但其部署仍受限于庞大的模型规模和计算成本。训练后模型剪枝为压缩大型语言模型提供了一种途径。然而,当前针对大型语言模型设计的剪枝方法与LoRA不兼容。这主要是因为它们采用对大型语言模型进行非结构化剪枝,阻碍了LoRA权重的合并,或者依赖于预训练权重的梯度来指导剪枝,这可能导致显著的内存开销。为此,我们提出了LoRAPrune,这是一个以高内存效率方式生成精确结构化剪枝模型的新框架。具体而言,我们首先设计了一种LoRA引导的剪枝准则,该准则使用LoRA的权重和梯度而非预训练权重的梯度进行重要性估计。随后,我们将此准则集成到迭代剪枝过程中,有效去除冗余的通道和注意力头。大量实验结果表明,我们的LoRAPrune在LLaMA系列模型上优于现有方法。在50%的压缩率下,LoRAPrune相比LLM-Pruner表现出更优的性能,在WikiText2数据集上困惑度降低4.81,在PTB数据集上降低3.46,同时内存使用量减少52.6%。此外,LoRAPrune在多个大型语言模型上也与半结构化剪枝方法性能相当,证明了其广泛的适用性。代码可在https://github.com/aim-uofa/LoRAPrune获取。