Large language models (LLMs) have become crucial for many generative downstream tasks, leading to an inevitable trend and significant challenge to deploy them efficiently on resource-constrained devices. Structured pruning is a widely used method to address this challenge. However, when dealing with the complex structure of the multiple decoder layers, general methods often employ common estimation approaches for pruning. These approaches lead to a decline in accuracy for specific downstream tasks. In this paper, we introduce a simple yet efficient method that adaptively models the importance of each substructure. Meanwhile, it can adaptively fuse coarse-grained and finegrained estimations based on the results from complex and multilayer structures. All aspects of our design seamlessly integrate into the endto-end pruning framework. Our experimental results, compared with state-of-the-art methods on mainstream datasets, demonstrate average accuracy improvements of 1.1%, 1.02%, 2.0%, and 1.2% for LLaMa-7B,Vicuna-7B, Baichuan-7B, and Bloom-7b1, respectively.
翻译:大语言模型(LLM)已成为许多生成式下游任务的关键,这导致在资源受限设备上高效部署它们成为必然趋势和重大挑战。结构化剪枝是应对这一挑战的常用方法。然而,在处理多层解码器的复杂结构时,通用方法通常采用常规估计手段进行剪枝,导致特定下游任务的精度下降。本文提出一种简单而高效的方法,能够自适应地建模每个子结构的重要性。同时,该方法能基于复杂多层结构的结果,自适应融合粗粒度和细粒度估计。我们设计的所有环节都无缝集成到端到端剪枝框架中。与主流数据集上的最新方法相比,实验结果表明,在LLaMa-7B、Vicuna-7B、Baichuan-7B和Bloom-7b1上,平均精度分别提升了1.1%、1.02%、2.0%和1.2%。