The considerable size of Large Language Models (LLMs) presents notable deployment challenges, particularly on resource-constrained hardware. Structured pruning, offers an effective means to compress LLMs, thereby reducing storage costs and enhancing inference speed for more efficient utilization. In this work, we study data-efficient and resource-efficient structure pruning methods to obtain smaller yet still powerful models. Knowledge Distillation is well-suited for pruning, as the intact model can serve as an excellent teacher for pruned students. However, it becomes challenging in the context of LLMs due to memory constraints. To address this, we propose an efficient progressive Numerous-teacher pruning method (NutePrune). NutePrune mitigates excessive memory costs by loading only one intact model and integrating it with various masks and LoRA modules, enabling it to seamlessly switch between teacher and student roles. This approach allows us to leverage numerous teachers with varying capacities to progressively guide the pruned model, enhancing overall performance. Extensive experiments across various tasks demonstrate the effectiveness of NutePrune. In LLaMA-7B zero-shot experiments, NutePrune retains 97.17% of the performance of the original model at 20% sparsity and 95.07% at 25% sparsity. Our code is available at https://github.com/Lucius-lsr/NutePrune.
翻译:大语言模型(LLMs)的庞大规模给实际部署带来了显著挑战,尤其在资源受限的硬件环境中。结构化剪枝为压缩LLMs提供了有效途径,从而降低存储成本并提升推理速度,实现更高效的模型使用。本研究聚焦于数据高效与资源高效的结构化剪枝方法,旨在获得更小但仍保持强大能力的模型。知识蒸馏技术非常适合用于剪枝,因为完整模型可作为剪枝后学生模型的优质教师。然而,在LLMs场景下,由于内存限制,这一过程面临挑战。为解决此问题,我们提出了一种高效的渐进式多教师剪枝方法(NutePrune)。NutePrune通过仅加载一个完整模型,并结合多种掩码与LoRA模块,使其能够在教师与学生角色间无缝切换,从而显著降低内存开销。该方法使我们能够利用多个不同容量的教师模型逐步指导剪枝过程,全面提升模型性能。在多种任务上的大量实验验证了NutePrune的有效性。在LLaMA-7B的零样本实验中,NutePrune在20%稀疏度下保持了原模型97.17%的性能,在25%稀疏度下保持95.07%。代码已开源:https://github.com/Lucius-lsr/NutePrune。