Scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks. However, this scaling also introduces redundant structures, posing challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different structures, such as MLP and Attention layers, is under-explored. In this work, we investigate the varying redundancy across different modules within Transformers, including Blocks, MLP, and Attention layers, using a similarity-based metric. This metric operates on the premise that redundant structures produce outputs highly similar to their inputs. Surprisingly, while attention layers are essential for transformers and distinguish them from other mainstream architectures, we found that a large proportion of attention layers exhibit excessively high similarity and can be safely pruned without degrading performance, leading to reduced memory and computation costs. Additionally, we further propose a method that jointly drops Attention and MLP layers, achieving improved performance and dropping ratios. Extensive experiments demonstrate the effectiveness of our methods, e.g., Llama-3-70B maintains comparable performance even after pruning half of the attention layers. Our findings provide valuable insights for future network architecture design. The code will be released at: \url{https://github.com/Shwai-He/LLM-Drop}.
翻译:基于 Transformer 的大语言模型(LLM)的规模化扩展已在多种任务中展现出优异的性能。然而,这种扩展也引入了冗余结构,为实际部署带来了挑战。尽管已有研究认识到 LLM 中存在冗余,但不同结构(如 MLP 和注意力层)之间冗余程度的差异性尚未得到充分探索。在本工作中,我们利用一种基于相似度的度量方法,研究了 Transformer 内部不同模块(包括模块、MLP 和注意力层)的冗余度差异。该度量方法基于一个前提:冗余结构产生的输出与其输入高度相似。令人惊讶的是,尽管注意力层对 Transformer 至关重要,并且是区别于其他主流架构的关键特征,但我们发现很大一部分注意力层表现出过高的相似性,可以在不影响性能的前提下安全地剪枝,从而降低内存和计算成本。此外,我们进一步提出了一种联合丢弃注意力层和 MLP 层的方法,实现了更好的性能和更高的丢弃比例。大量实验证明了我们方法的有效性,例如,Llama-3-70B 在剪除一半注意力层后仍能保持可比的性能。我们的发现为未来的网络架构设计提供了有价值的见解。代码将发布于:\url{https://github.com/Shwai-He/LLM-Drop}。