Large language models (LLMs) have become crucial for many generative downstream tasks, leading to an inevitable trend and significant challenge to deploy them efficiently on resource-constrained devices. Structured pruning is a widely used method to address this challenge. However, when dealing with the complex structure of the multiple decoder layers, general methods often employ common estimation approaches for pruning. These approaches lead to a decline in accuracy for specific downstream tasks. In this paper, we introduce a simple yet efficient method that adaptively models the importance of each substructure. Meanwhile, it can adaptively fuse coarse-grained and finegrained estimations based on the results from complex and multilayer structures. All aspects of our design seamlessly integrate into the endto-end pruning framework. Our experimental results, compared with state-of-the-art methods on mainstream datasets, demonstrate average accuracy improvements of 1.1%, 1.02%, 2.0%, and 1.2% for LLaMa-7B,Vicuna-7B, Baichuan-7B, and Bloom-7b1, respectively.
翻译:大型语言模型(LLM)已成为众多生成式下游任务的关键,这导致在资源受限设备上高效部署它们成为必然趋势和重大挑战。结构化剪枝是应对该挑战的常用方法。然而,当处理多层解码器的复杂结构时,通用方法通常采用常见的估计方式进行剪枝,这会导致特定下游任务的准确率下降。本文提出一种简单高效的方法,该方法能够自适应地对每个子结构的重要性进行建模,同时基于复杂多层结构的结果,自适应地融合粗粒度与细粒度估计。我们设计的各方面均无缝集成到端到端剪枝框架中。与主流数据集上最先进方法相比,实验结果表明,该方法在LLaMa-7B、Vicuna-7B、Baichuan-7B和Bloom-7b1上分别实现了平均准确率提升1.1%、1.02%、2.0%和1.2%。