The recent advancements in large language models (LLMs) have significantly improved language understanding and generation capabilities. However, it is difficult to deploy LLMs on resource-constrained edge devices due to their high computational and storage resource demands. To address this issue, we propose a novel LLM model pruning method, namely structurally-aware adaptive pruning (SAAP), to significantly reduce the computational and memory costs while maintaining model performance. We first define an adaptive importance fusion metric to evaluate the importance of all coupled structures in LLMs by considering their homoscedastic uncertainty. Then, we rank the importance of all modules to determine the specific layers that should be pruned to meet particular performance requirements. Furthermore, we develop a new group fine-tuning strategy to improve the inference efficiency of LLMs. Finally, we evaluate the proposed SAAP method on multiple LLMs across two common tasks, i.e., zero-shot classification and text generation. Experimental results show that our SAAP method outperforms several state-of-the-art baseline methods, achieving 2.17%, 2.37%, and 2.39% accuracy gains on LLaMA-7B, Vicuna-7B, and LLaMA-13B. Additionally, SAAP improves the token generation speed by 5%, showcasing its practical advantages in resource-constrained scenarios.
翻译:近年来,大语言模型(LLMs)的显著进展极大地提升了语言理解与生成能力。然而,由于其高昂的计算与存储资源需求,在资源受限的边缘设备上部署LLMs存在困难。为解决此问题,我们提出了一种新颖的LLM模型剪枝方法,即结构感知自适应剪枝(SAAP),旨在显著降低计算与内存开销的同时保持模型性能。我们首先定义了一种自适应重要性融合度量,通过考虑同方差不确定性来评估LLMs中所有耦合结构的重要性。随后,我们对所有模块的重要性进行排序,以确定应被剪枝的特定层,从而满足特定的性能要求。此外,我们开发了一种新的分组微调策略,以提高LLMs的推理效率。最后,我们在多个LLMs上,针对零样本分类和文本生成两项常见任务,对所提出的SAAP方法进行了评估。实验结果表明,我们的SAAP方法优于多种最先进的基线方法,在LLaMA-7B、Vicuna-7B和LLaMA-13B上分别取得了2.17%、2.37%和2.39%的准确率提升。此外,SAAP将token生成速度提高了5%,展现了其在资源受限场景下的实用优势。