Existing pruning techniques for large language models (LLMs) targeting domain-specific applications typically follow a two-stage process: pruning the pretrained general-purpose LLMs and then fine-tuning the pruned LLMs on specific domains. However, the pruning decisions, derived from the pretrained weights, remain unchanged during fine-tuning, even if the weights have been updated. Therefore, such a combination of the pruning decisions and the finetuned weights may be suboptimal, leading to non-negligible performance degradation. To address these limitations, we propose ATP: All-in-One Tuning and Structural Pruning, a unified one-stage structural pruning and fine-tuning approach that dynamically identifies the current optimal substructure throughout the fine-tuning phase via a trainable pruning decision generator. Moreover, given the limited available data for domain-specific applications, Low-Rank Adaptation (LoRA) becomes a common technique to fine-tune the LLMs. In ATP, we introduce LoRA-aware forward and sparsity regularization to ensure that the substructures corresponding to the learned pruning decisions can be directly removed after the ATP process. ATP outperforms the state-of-the-art two-stage pruning methods on tasks in the legal and healthcare domains. More specifically, ATP recovers up to 88% and 91% performance of the dense model when pruning 40% parameters of LLaMA2-7B and LLaMA3-8B models, respectively.
翻译:现有面向领域特定应用的大语言模型剪枝技术通常遵循两阶段流程:首先对预训练的通用大语言模型进行剪枝,随后在特定领域对剪枝后模型进行微调。然而,基于预训练权重得出的剪枝决策在微调过程中保持不变,即使权重已被更新。因此,这种剪枝决策与微调权重的组合可能并非最优,会导致不可忽视的性能下降。为克服这些局限,我们提出ATP:一体化调优与结构化剪枝,这是一种统一单阶段结构化剪枝与微调方法,通过可训练的剪枝决策生成器在微调全程动态识别当前最优子结构。此外,鉴于领域特定应用可用数据有限,低秩自适应已成为微调大语言模型的常用技术。在ATP中,我们引入LoRA感知前向传播与稀疏性正则化,以确保学习所得剪枝决策对应的子结构在ATP过程结束后可直接移除。ATP在法律与医疗领域任务上超越了最先进的两阶段剪枝方法。具体而言,当对LLaMA2-7B和LLaMA3-8B模型分别剪枝40%参数时,ATP可恢复稠密模型高达88%与91%的性能。