Large Language Models (LLMs) have achieved significant success across various NLP tasks. However, their massive computational costs limit their widespread use, particularly in real-time applications. Structured pruning offers an effective solution by compressing models and directly providing end-to-end speed improvements, regardless of the hardware environment. Meanwhile, different components of the model exhibit varying sensitivities towards pruning, calling for non-uniform model compression. However, a pruning method should not only identify a capable substructure, but also account for post-compression training. To this end, we propose DarwinLM, a method for training-aware structured pruning. DarwinLM builds upon an evolutionary search process, generating multiple offspring models in each generation through mutation, and selecting the fittest for survival. To assess the effect of post-training, we incorporate a lightweight, multistep training process within the offspring population, progressively increasing the number of tokens and eliminating poorly performing models in each selection stage. We validate our method through extensive experiments on Llama-2-7B, Llama-3.1-8B and Qwen-2.5-14B-Instruct, achieving state-of-the-art performance for structured pruning. For instance, DarwinLM surpasses ShearedLlama while requiring 5x less training data during post-compression training. Code is at: https://github.com/IST-DASLab/DarwinLM
翻译:大语言模型(LLMs)在各类自然语言处理任务中取得了显著成功。然而,其庞大的计算成本限制了其广泛应用,特别是在实时应用场景中。结构化剪枝提供了一种有效的解决方案,通过压缩模型直接实现端到端的加速效果,且不受硬件环境限制。同时,模型的不同组件对剪枝表现出不同的敏感性,这要求采用非均匀的模型压缩策略。然而,一种剪枝方法不仅需要识别出有效的子结构,还需考虑压缩后的训练过程。为此,我们提出了达尔文LM,一种面向训练感知的结构化剪枝方法。该方法基于进化搜索过程,在每一代中通过突变生成多个子代模型,并选择适应度最高的个体存活。为评估训练后效果,我们在子代种群中引入了轻量级的多步训练流程,逐步增加训练标记数量,并在每个选择阶段淘汰性能不佳的模型。我们在Llama-2-7B、Llama-3.1-8B和Qwen-2.5-14B-Instruct模型上进行了广泛实验验证,在结构化剪枝任务中取得了最先进的性能表现。例如,达尔文LM在压缩后训练阶段仅需ShearedLlama五分之一训练数据的情况下,性能仍超越后者。代码已开源:https://github.com/IST-DASLab/DarwinLM