Large pre-trained models (LPMs), such as LLaMA and GLM, have shown exceptional performance across various tasks through fine-tuning. Although low-rank adaption (LoRA) has emerged to cheaply fine-tune these LPMs on downstream tasks, their deployment is still hindered by the vast model scale and computational costs. Neural network pruning offers a way to compress LPMs. However, the current pruning methods designed for LPMs are not compatible with LoRA. This is due to their utilization of unstructured pruning on LPMs, impeding the merging of LoRA weights, or their dependence on the gradients of pre-trained weights to guide pruning, which can impose significant memory overhead. To this end, we propose LoRAPrune, a new framework that delivers an accurate, compact model for efficient inference in a highly memory-effective manner. Specifically, we first design a LoRA-guided pruning criterion, which uses the weights and gradients of LoRA, rather than the gradients of pre-trained weights for importance estimation. We then propose a structured iterative pruning procedure, to remove redundant channels and heads. Extensive experimental results demonstrate the superior performance of our LoRAPrune over existing approaches on the LLaMA series models. For instance, at a 50\% compression rate, LoRAPrune outperforms LLM-Pruner by a perplexity reduction of 8.0 on WikiText2 and 16.05 on PTB datasets, while concurrently reducing memory usage by 52.6\%. The code will be released after review
翻译:大型预训练模型(LPMs),如LLaMA和GLM,通过微调已在各种任务中展现出卓越性能。尽管低秩适应(LoRA)的出现为在下游任务中低成本微调这些LPMs提供了途径,但其部署仍受到模型规模庞大和计算成本高昂的制约。神经网络剪枝提供了一种压缩LPMs的方法。然而,当前为LPMs设计的剪枝方法与LoRA不兼容。原因在于,这些方法要么对LPMs采用非结构化剪枝,阻碍了LoRA权重的合并;要么依赖预训练权重的梯度来指导剪枝,这会导致显著的内存开销。为此,我们提出LoRAPrune,一种新的框架,能以高度内存高效的方式生成准确且紧凑的模型,用于高效推理。具体而言,我们首先设计了一种LoRA引导的剪枝准则,该准则利用LoRA的权重和梯度(而非预训练权重的梯度)进行重要性估计。接着,我们提出了一种结构化迭代剪枝流程,以移除冗余通道和注意力头。大量实验结果表明,我们的LoRAPrune在LLaMA系列模型上的性能优于现有方法。例如,在50%压缩率下,LoRAPrune在WikiText2和PTB数据集上的困惑度分别比LLM-Pruner降低了8.0和16.05,同时内存使用减少52.6%。代码将在评审后发布。