Large Language Models (LLMs) achieve strong performance across diverse tasks but face deployment challenges due to their massive size. Structured pruning offers acceleration benefits but leads to significant performance degradation. Recent PCA-based pruning methods have alleviated this issue by retaining key activation components, but are only applied between modules in order to fuse the transformation matrix, which introduces extra parameters and severely disrupts activation distributions due to residual connections. To address these issues, we propose IntraSlice, a framework that applies block-wise module-intra PCA compression pruning. By leveraging the structural characteristics of Transformer modules, we design an approximate PCA method whose transformation matrices can be fully fused into the model without additional parameters. We also introduce a PCA-based global pruning ratio estimator that further considers the distribution of compressed activations, building on conventional module importance. We validate our method on Llama2, Llama3, and Phi series across various language benchmarks. Experimental results demonstrate that our approach achieves superior compression performance compared to recent baselines at the same compression ratio or inference speed.
翻译:大语言模型(LLMs)在多样化任务中展现出卓越性能,但其庞大的规模给实际部署带来挑战。结构化剪枝虽能提升推理速度,却会导致显著的性能下降。近期基于主成分分析(PCA)的剪枝方法通过保留关键激活成分缓解了此问题,但现有方法仅能在模块间应用PCA以融合变换矩阵,这不仅引入了额外参数,且由于残差连接的存在会严重破坏激活分布。为解决这些问题,我们提出IntraSlice框架,采用基于块的模块内PCA压缩剪枝方案。通过利用Transformer模块的结构特性,我们设计了一种近似PCA方法,其变换矩阵可完全融合至模型中且无需额外参数。此外,我们在传统模块重要性评估基础上,进一步引入基于PCA的全局剪枝比例估计器,该估计器综合考虑了压缩后激活的分布特性。我们在Llama2、Llama3及Phi系列模型上通过多种语言基准测试验证了本方法。实验结果表明,在相同压缩比或推理速度条件下,我们的方法相比近期基线模型实现了更优的压缩性能。