IntraSlice: Towards High-Performance Structural Pruning with Block-Intra PCA for LLMs

Large Language Models (LLMs) achieve strong performance across diverse tasks but face deployment challenges due to their massive size. Structured pruning offers acceleration benefits but leads to significant performance degradation. Recent PCA-based pruning methods have alleviated this issue by retaining key activation components, but are only applied between modules in order to fuse the transformation matrix, which introduces extra parameters and severely disrupts activation distributions due to residual connections. To address these issues, we propose IntraSlice, a framework that applies block-wise module-intra PCA compression pruning. By leveraging the structural characteristics of Transformer modules, we design an approximate PCA method whose transformation matrices can be fully fused into the model without additional parameters. We also introduce a PCA-based global pruning ratio estimator that further considers the distribution of compressed activations, building on conventional module importance. We validate our method on Llama2, Llama3, and Phi series across various language benchmarks. Experimental results demonstrate that our approach achieves superior compression performance compared to recent baselines at the same compression ratio or inference speed.

翻译：大语言模型（LLMs）在多样化任务中展现出卓越性能，但其庞大的规模给实际部署带来挑战。结构化剪枝虽能提升推理速度，却会导致显著的性能下降。近期基于主成分分析（PCA）的剪枝方法通过保留关键激活成分缓解了此问题，但现有方法仅能在模块间应用PCA以融合变换矩阵，这不仅引入了额外参数，且由于残差连接的存在会严重破坏激活分布。为解决这些问题，我们提出IntraSlice框架，采用基于块的模块内PCA压缩剪枝方案。通过利用Transformer模块的结构特性，我们设计了一种近似PCA方法，其变换矩阵可完全融合至模型中且无需额外参数。此外，我们在传统模块重要性评估基础上，进一步引入基于PCA的全局剪枝比例估计器，该估计器综合考虑了压缩后激活的分布特性。我们在Llama2、Llama3及Phi系列模型上通过多种语言基准测试验证了本方法。实验结果表明，在相同压缩比或推理速度条件下，我们的方法相比近期基线模型实现了更优的压缩性能。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

【新书】设计大型语言模型应用：一种面向LLMs的整体方法

专知会员服务

56+阅读 · 2025年3月16日

《多模态大型语言模型》最新进展，详述26种现有MM-LLMs

专知会员服务

65+阅读 · 2024年1月25日

【AAAI2024】公平感知的Transformer模型结构剪枝

专知会员服务

43+阅读 · 2023年12月27日

【AAAI2024】基于波动的自适应结构化修剪方法，用于大型语言模型

专知会员服务

21+阅读 · 2023年12月21日