Large language models (LLMs) have proven to be highly effective across various natural language processing tasks. However, their large number of parameters poses significant challenges for practical deployment. Pruning, a technique aimed at reducing the size and complexity of LLMs, offers a potential solution by removing redundant components from the network. Despite the promise of pruning, existing methods often struggle to achieve substantial end-to-end LLM inference speedup. In this paper, we introduce SLEB, a novel approach designed to streamline LLMs by eliminating redundant transformer blocks. We choose the transformer block as the fundamental unit for pruning, because LLMs exhibit block-level redundancy with high similarity between the outputs of neighboring blocks. This choice allows us to effectively enhance the processing speed of LLMs. Our experimental results demonstrate that SLEB successfully accelerates LLM inference without compromising the linguistic capabilities of these models, making it a promising technique for optimizing the efficiency of LLMs. The code is available at: https://github.com/leapingjagg-dev/SLEB
翻译:大规模语言模型在多种自然语言处理任务中展现出卓越性能,但其庞大的参数量给实际部署带来了严峻挑战。剪枝作为一种旨在降低大语言模型规模与复杂度的技术,通过移除网络中冗余组件提供了潜在解决方案。尽管剪枝技术前景广阔,现有方法却往往难以实现显著的大语言模型端到端推理加速。本文提出SLEB——一种通过剔除冗余Transformer模块来精简大语言模型的新型方法。我们选用Transformer模块作为基本剪枝单元,原因在于相邻模块输出具有高度相似性,导致大语言模型存在模块级冗余。这一选择使我们能够有效提升大语言模型的处理速度。实验结果表明,SLEB成功加速了大语言模型的推理过程,同时保持了其语言能力,成为优化大语言模型性能的一项有前景的技术。相关代码已开源:https://github.com/leapingjagg-dev/SLEB