As Large Language Models (LLMs) continue to advance in performance, their size has escalated significantly, with current LLMs containing billions or even trillions of parameters. However, in this study, we discovered that many layers of LLMs exhibit high similarity, and some layers play a negligible role in network functionality. Based on this observation, we define a metric called Block Influence (BI) to gauge the significance of each layer in LLMs. We then propose a straightforward pruning approach: layer removal, in which we directly delete the redundant layers in LLMs based on their BI scores. Experiments demonstrate that our method, which we call ShortGPT, significantly outperforms previous state-of-the-art (SOTA) methods in model pruning. Moreover, ShortGPT is orthogonal to quantization-like methods, enabling further reduction in parameters and computation. The ability to achieve better results through simple layer removal, as opposed to more complex pruning techniques, suggests a high degree of redundancy in the model architecture.
翻译:随着大语言模型(LLMs)在性能上的持续进步,其规模显著增大,当前LLMs包含数十亿甚至数万亿参数。然而本研究发现,LLMs的许多层表现出高度相似性,部分层在网络功能中作用微乎其微。基于这一观察,我们定义了一种名为块影响力(BI)的指标来衡量LLMs中各层的重要性,并据此提出一种简洁的剪枝方法:层移除——直接根据各层的BI分数删除冗余层。实验表明,我们称之为ShortGPT的方法在模型剪枝上显著优于现有最优(SOTA)方法。此外,ShortGPT与量化类方法正交,可进一步减少参数与计算量。相较于更复杂的剪枝技术,通过简单层移除即可获得更优结果的能力,暗示了模型架构存在高度冗余性。