Structured pruning fundamentally reduces computational and memory overheads of large language models (LLMs) and offers a feasible solution for end-side LLM deployment. Structurally pruned models remain dense and high-precision, highly compatible with further tuning and compression. However, as the coarse-grained structured pruning poses large damage to the highly interconnected model, achieving a high compression ratio for scaled-up LLMs remains a challenge. In this paper, we introduce a task-agnostic structured pruning approach coupled with a compact Transformer architecture design. The proposed approach, named TransAct, reduces transitional activations inside multi-head attention (MHA) and multi-layer perceptron (MLP) modules, while preserving the inter-module activations that are sensitive to perturbations. Hence, the LLM is pruned into an intra-module low-rank architecture, significantly reducing weights, KV Cache and attention computation. TransAct is implemented on the LLaMA model and evaluated on downstream benchmarks. Results verify the optimality of our approach at high compression with respect to both efficiency and performance. Further, ablation studies reveal the strength of activation-guided iterative pruning and provide experimental analysis on the redundancy of MHA and MLP modules.
翻译:结构化剪枝从根本上降低了大语言模型(LLMs)的计算和内存开销,为终端侧LLM部署提供了可行的解决方案。经过结构化剪枝的模型仍保持稠密和高精度,与进一步的微调和压缩高度兼容。然而,由于粗粒度结构化剪枝对高度互联的模型造成较大损伤,为规模化LLMs实现高压缩比仍然是一个挑战。本文提出了一种与任务无关的结构化剪枝方法,并结合紧凑的Transformer架构设计。所提出的方法命名为TransAct,它减少了多头注意力(MHA)和多层感知机(MLP)模块内部的过渡激活,同时保留了对外界扰动敏感的模块间激活。因此,LLM被剪枝为一种模块内低秩架构,显著减少了权重、KV缓存和注意力计算量。TransAct在LLaMA模型上实现,并在下游基准测试中进行了评估。结果验证了我们的方法在高压缩率下在效率和性能方面的最优性。此外,消融研究揭示了基于激活的迭代剪枝的优势,并对MHA和MLP模块的冗余性提供了实验分析。