As their size increases, Large Languages Models (LLMs) are natural candidates for network pruning methods: approaches that drop a subset of network weights while striving to preserve performance. Existing methods, however, require either retraining, which is rarely affordable for billion-scale LLMs, or solving a weight reconstruction problem reliant on second-order information, which may also be computationally expensive. In this paper, we introduce a novel, straightforward yet effective pruning method, termed Wanda (Pruning by Weights and activations), designed to induce sparsity in pretrained LLMs. Motivated by the recent observation of emergent large magnitude features in LLMs, our approach prune weights with the smallest magnitudes multiplied by the corresponding input activations, on a per-output basis. Notably, Wanda requires no retraining or weight update, and the pruned LLM can be used as is. We conduct a thorough evaluation of our method on LLaMA across various language benchmarks. Wanda significantly outperforms the established baseline of magnitude pruning and competes favorably against recent methods involving intensive weight update. Code is available at https://github.com/locuslab/wanda.
翻译:随着模型规模的增长,大型语言模型(LLMs)成为网络剪枝方法的天然候选:这类方法在努力保持性能的同时丢弃部分网络权重。然而,现有方法要么需要重新训练(这对十亿参数级别的LLMs而言几乎不可承受),要么需要求解依赖二阶信息的权重重构问题(这同样计算成本高昂)。本文提出一种新颖、直接且有效的剪枝方法——Wanda(基于权重和激活值的剪枝),旨在为预训练LLMs引入稀疏性。受近期关于LLMs中涌现大尺度特征现象的启发,我们的方法以输出为基础,剪除权重绝对值与其对应输入激活值乘积最小的权重。值得注意的是,Wanda无需重新训练或权重更新,剪枝后的LLM可直接使用。我们在LLaMA模型上跨多个语言基准进行了全面评估。实验表明,Wanda显著优于传统的幅度剪枝基线,且与涉及密集权重更新的最新方法相比具有竞争力。代码已开源在https://github.com/locuslab/wanda。