Overparametrized transformer networks are the state-of-the-art architecture for Large Language Models (LLMs). However, such models contain billions of parameters making large compute a necessity, while raising environmental concerns. To address these issues, we propose FinerCut, a new form of fine-grained layer pruning, which in contrast to prior work at the transformer block level, considers all self-attention and feed-forward network (FFN) layers within blocks as individual pruning candidates. FinerCut prunes layers whose removal causes minimal alternation to the model's output -- contributing to a new, lean, interpretable, and task-agnostic pruning method. Tested across 9 benchmarks, our approach retains 90% performance of Llama3-8B with 25% layers removed, and 95% performance of Llama3-70B with 30% layers removed, all without fine-tuning or post-pruning reconstruction. Strikingly, we observe intriguing results with FinerCut: 42% (34 out of 80) of the self-attention layers in Llama3-70B can be removed while preserving 99% of its performance -- without additional fine-tuning after removal. Moreover, FinerCut provides a tool to inspect the types and locations of pruned layers, allowing to observe interesting pruning behaviors. For instance, we observe a preference for pruning self-attention layers, often at deeper consecutive decoder layers. We hope our insights inspire future efficient LLM architecture designs.
翻译:过参数化的Transformer网络是当前大语言模型(LLMs)的先进架构。然而,这类模型包含数十亿参数,不仅需要巨大的计算资源,也引发了环境问题。为解决这些问题,我们提出FinerCut——一种新型细粒度层剪枝方法。与先前在Transformer块级别进行剪枝的工作不同,该方法将块内所有自注意力层和前馈网络(FFN)层均视为独立的剪枝候选对象。FinerCut通过剪除对模型输出影响最小的层,形成了一种新颖、轻量化、可解释且任务无关的剪枝方法。在9个基准测试中,我们的方法在移除25%层的情况下保持了Llama3-8B模型90%的性能,在移除30%层的情况下保持了Llama3-70B模型95%的性能,且均无需微调或剪枝后重构。值得注意的是,我们通过FinerCut观察到引人深思的结果:Llama3-70B中42%(80层中的34层)的自注意力层可在移除后保持99%的原始性能——且无需移除后的额外微调。此外,FinerCut提供了检查被剪枝层类型与位置的分析工具,从而能够观察到有趣的剪枝行为模式。例如,我们观察到模型倾向于剪除自注意力层,且多位于较深的连续解码器层中。我们希望这些发现能为未来高效大语言模型架构设计提供启发。