Large language models (LLMs) have proven to be highly effective across various natural language processing tasks. However, their large number of parameters poses significant challenges for practical deployment. Pruning, a technique aimed at reducing the size and complexity of LLMs, offers a potential solution by removing redundant components from the network. Despite the promise of pruning, existing methods often struggle to achieve substantial end-to-end LLM inference speedup. In this paper, we introduce SLEB, a novel approach designed to streamline LLMs by eliminating redundant transformer blocks. We choose the transformer block as the fundamental unit for pruning, because LLMs exhibit block-level redundancy with high similarity between the outputs of neighboring blocks. This choice allows us to effectively enhance the processing speed of LLMs. Our experimental results demonstrate that SLEB outperforms previous LLM pruning methods in accelerating LLM inference while also maintaining superior perplexity and accuracy, making SLEB as a promising technique for enhancing the efficiency of LLMs. The code is available at: https://github.com/jiwonsong-dev/SLEB.
翻译:大型语言模型(LLMs)已在各类自然语言处理任务中展现出卓越性能。然而,其庞大的参数量给实际部署带来了显著挑战。剪枝作为一种旨在缩减LLMs规模与复杂度的技术,通过移除网络中的冗余组件提供了潜在解决方案。尽管剪枝技术前景广阔,现有方法往往难以实现显著的端到端LLM推理加速。本文提出SLEB——一种通过消除冗余Transformer模块以精简LLMs的创新方法。我们选择Transformer模块作为剪枝的基本单元,因为LLMs在相邻模块输出间表现出高度相似性的模块级冗余特性。这一选择使我们能够有效提升LLMs的处理速度。实验结果表明,SLEB在加速LLM推理方面优于现有剪枝方法,同时保持更优的困惑度与准确率,使其成为提升LLMs效率的前沿技术。代码已开源:https://github.com/jiwonsong-dev/SLEB。