The remarkable success of Large Language Models (LLMs) relies heavily on their substantial scale, which poses significant challenges during model deployment in terms of latency and memory consumption. Recently, numerous studies have attempted to compress LLMs using one-shot pruning methods. However, these methods often suffer from considerable performance degradation on complex language understanding tasks, raising concerns about the feasibility of pruning in LLMs. To address this issue, we propose Adaptive Sparse Trainer (AST), a novel and efficient retraining framework tailored for semi-structured sparse models. AST enables models to learn optimal masks during the weight update process without incurring additional computational overhead. Furthermore, we demonstrate that incorporating knowledge distillation significantly improves retraining efficiency and enhances model performance under fixed computational constraints. Additionally, a supplementary set of well-initialized parameters is integrated to further augment the model's efficacy. AST achieves state-of-the-art performance with minimal training cost. When applied to the LLaMA2-7B model, AST reduces the perplexity and zero-shot accuracy gap between dense and 2:4 semi-structured sparse models to 0.6 and 1.16%, respectively, utilizing less than 0.4% of the pretraining tokens and GPU hours. Our work demonstrates the feasibility of deploying semi-structured sparse LLMs and offers a promising alternative for achieving highly compressed models when combined with existing quantization techniques.
翻译:大型语言模型(LLM)的显著成功在很大程度上依赖于其庞大的规模,这在模型部署时带来了显著的延迟和内存消耗挑战。近期,大量研究尝试使用一次性剪枝方法来压缩LLM。然而,这些方法通常在复杂语言理解任务上存在显著的性能下降,引发了关于LLM剪枝可行性的担忧。为解决这一问题,我们提出了自适应稀疏训练器(AST),这是一种专为半结构化稀疏模型设计的新型高效重训练框架。AST使模型能够在权重更新过程中学习最优掩码,而不会产生额外的计算开销。此外,我们证明,结合知识蒸馏能显著提高重训练效率,并在固定计算约束下增强模型性能。同时,引入一组精心初始化的补充参数以进一步提升模型效能。AST以极低的训练成本实现了最先进的性能。当应用于LLaMA2-7B模型时,AST将稠密模型与2:4半结构化稀疏模型之间的困惑度差距和零样本准确率差距分别降低至0.6和1.16%,且使用的预训练token和GPU小时数均少于原始训练的0.4%。我们的工作证明了部署半结构化稀疏LLM的可行性,并为结合现有量化技术实现高度压缩模型提供了一种有前景的替代方案。