Neural Networks can be efficiently compressed through pruning, significantly reducing storage and computational demands while maintaining predictive performance. Simple yet effective methods like Iterative Magnitude Pruning (IMP, Han et al., 2015) remove less important parameters and require a costly retraining procedure to recover performance after pruning. However, with the rise of Large Language Models (LLMs), full retraining has become infeasible due to memory and compute constraints. In this study, we challenge the practice of retraining all parameters by demonstrating that updating only a small subset of highly expressive parameters is often sufficient to recover or even improve performance compared to full retraining. Surprisingly, retraining as little as 0.27%-0.35% of the parameters of GPT-architectures achieves comparable performance to One Shot IMP across various sparsity levels. Our approach, Parameter-Efficient Retraining after Pruning (PERP), drastically reduces compute and memory demands, enabling pruning and retraining of up to 30 billion parameter models on a single NVIDIA A100 GPU within minutes. Despite magnitude pruning being considered as unsuited for pruning LLMs, our findings show that PERP positions it as a strong contender against state-of-the-art retraining-free approaches such as Wanda (Sun et al., 2023) and SparseGPT (Frantar & Alistarh, 2023), opening up a promising alternative to avoiding retraining.
翻译:神经网络可通过剪枝实现高效压缩,显著降低存储与计算需求的同时保持预测性能。诸如迭代幅度剪枝(IMP, Han et al., 2015)等简单有效的方法会移除重要性较低的参数,但需通过代价高昂的重训练流程才能恢复剪枝后的性能。然而,随着大语言模型(LLMs)的兴起,全量重训练因内存和计算限制而变得不可行。本研究中,我们挑战全参数重训练的惯例,证明仅更新少量高表达能力参数往往足以恢复甚至提升性能,且效果可与全量重训练相媲美。令人惊讶的是,仅重训练GPT架构中0.27%-0.35%的参数,在不同稀疏度水平下即可达到与One Shot IMP相当的性能。我们的方法——剪枝后的参数高效重训练(PERP)——大幅降低了计算与内存需求,可在单个NVIDIA A100 GPU上数分钟内完成高达300亿参数模型的剪枝与重训练。尽管幅度剪枝被认为不适用于LLMs剪枝,但我们的发现表明,PERP使其成为与Wanda(Sun et al., 2023)和SparseGPT(Frantar & Alistarh, 2023)等先进免重训练方法的有力竞争者,为规避重训练开辟了有前景的替代方案。