Pruning Large Language Models with Semi-Structural Adaptive Sparse Training

The remarkable success of Large Language Models (LLMs) relies heavily on their substantial scale, which poses significant challenges during model deployment in terms of latency and memory consumption. Recently, numerous studies have attempted to compress LLMs using one-shot pruning methods. However, these methods often suffer from considerable performance degradation on complex language understanding tasks, raising concerns about the feasibility of pruning in LLMs. To address this issue, we propose Adaptive Sparse Trainer (AST), a novel and efficient retraining framework tailored for semi-structured sparse models. AST enables models to learn optimal masks during the weight update process without incurring additional computational overhead. Furthermore, we demonstrate that incorporating knowledge distillation significantly improves retraining efficiency and enhances model performance under fixed computational constraints. Additionally, a supplementary set of well-initialized parameters is integrated to further augment the model's efficacy. AST achieves state-of-the-art performance with minimal training cost. When applied to the LLaMA2-7B model, AST reduces the perplexity and zero-shot accuracy gap between dense and 2:4 semi-structured sparse models to 0.6 and 1.16%, respectively, utilizing less than 0.4% of the pretraining tokens and GPU hours. Our work demonstrates the feasibility of deploying semi-structured sparse LLMs and offers a promising alternative for achieving highly compressed models when combined with existing quantization techniques.

翻译：大型语言模型（LLM）的显著成功在很大程度上依赖于其庞大的规模，这在模型部署时带来了显著的延迟和内存消耗挑战。近期，大量研究尝试使用一次性剪枝方法来压缩LLM。然而，这些方法通常在复杂语言理解任务上存在显著的性能下降，引发了关于LLM剪枝可行性的担忧。为解决这一问题，我们提出了自适应稀疏训练器（AST），这是一种专为半结构化稀疏模型设计的新型高效重训练框架。AST使模型能够在权重更新过程中学习最优掩码，而不会产生额外的计算开销。此外，我们证明，结合知识蒸馏能显著提高重训练效率，并在固定计算约束下增强模型性能。同时，引入一组精心初始化的补充参数以进一步提升模型效能。AST以极低的训练成本实现了最先进的性能。当应用于LLaMA2-7B模型时，AST将稠密模型与2:4半结构化稀疏模型之间的困惑度差距和零样本准确率差距分别降低至0.6和1.16%，且使用的预训练token和GPU小时数均少于原始训练的0.4%。我们的工作证明了部署半结构化稀疏LLM的可行性，并为结合现有量化技术实现高度压缩模型提供了一种有前景的替代方案。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日