Making large language models (LLMs) more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a promising technique, but existing pruning methods are limited: width pruning often breaks the standard transformer layout, requiring custom inference code, while depth pruning can cause abrupt accuracy drops. Also, while many pruning approaches are effective against LLMs, they struggle to maintain performance on small language models (SLMs). In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/LM head layers and (ii) prunes FFN intermediate channels using common-token-weighted activations, aligning importance with the post-pruning token distribution. COMPACT inherits strengths of both depth and width pruning, such as: deployment-friendliness (keeps a standard transformer architecture), scale-adaptivity (trade off vocab. vs. FFN pruning), competitive pruning times, and strong memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream performance, with substantial reductions in parameters, GPU memory, and latency.
翻译:提升大型语言模型(LLM)在内存、延迟和服务成本方面的效率,对于边缘部署、交互式应用及规模化可持续推理至关重要。剪枝是一种前景广阔的技术,但现有剪枝方法存在局限:宽度剪枝常破坏标准Transformer结构,需定制推理代码;而深度剪枝则可能导致精度骤降。此外,尽管许多剪枝方法对LLM有效,但在小型语言模型(SLM)上难以保持性能。本文提出COMPACT方法,其联合实现了:(i)通过剪除罕见词汇以压缩嵌入层和语言模型头部层;(ii)利用公共令牌加权的激活值对前馈网络中间通道进行剪枝,使重要性评估与剪枝后的令牌分布对齐。COMPACT继承了深度与宽度剪枝的双重优势,包括:部署友好性(保持标准Transformer架构)、规模自适应性(权衡词汇剪枝与前馈网络剪枝)、高效的剪枝时间,以及在显著节省内存的同时提升吞吐量。在Qwen、LLaMA和Gemma系列模型(0.5B-70B)上的实验表明,该方法在下游任务中达到了最先进的性能,并大幅减少了参数量、GPU内存占用和推理延迟。