Pruning Large Language Models with Semi-Structural Adaptive Sparse Training

Transformer-based Large Language Models (LLMs) have demonstrated remarkable success across various challenging tasks. However, the deployment of LLMs is hindered by their substantial parameter count and memory consumption. Recently, numerous studies have attempted to compress LLMs by pruning them using training-free methods. However, these pruned models often experience significant performance degradation on complex tasks. To address this issue, we propose a novel training pipeline for semi-structured sparse models, named Adaptive Sparse Trainer (AST). By distilling the knowledge stored in its dense counterpart, we prevent the sparse model from overfitting and ensure a stable training process. Moreover, AST allows the model to adaptively select better lottery tickets (e.g., masks) during training. Additionally, we discovered that adding extra well-initialized parameters can further enhance model performance with only a small increase in memory footprint. Our method significantly narrows the performance gap between dense and sparse models while maintaining limited computational cost. Furthermore, when combined with existing quantization methods, AST can compress language models by up to 16x compared to dense FP32 precision models with minimal performance loss. AST outperforms previous state-of-the-art methods by reducing the zero-shot accuracy gap between dense and semi-structured sparse models to 1.12% across multiple zero-shot tasks on Llama2-7B, using less than 0.4% of the pretraining tokens.

翻译：基于Transformer的大语言模型（LLMs）在各种挑战性任务中展现出卓越的性能。然而，LLMs的部署因其庞大的参数量与内存消耗而受到限制。近期，众多研究尝试通过免训练剪枝方法对LLMs进行压缩。然而，这些剪枝模型在复杂任务上往往出现显著的性能下降。为解决该问题，我们提出了一种面向半结构化稀疏模型的新型训练流程，称为自适应稀疏训练器（AST）。通过蒸馏其稠密对应模型中存储的知识，我们防止稀疏模型过拟合并确保训练过程稳定。此外，AST允许模型在训练过程中自适应地选择更优的彩票假设（例如掩码）。同时，我们发现添加额外经过良好初始化的参数能以微小的内存开销进一步提升模型性能。我们的方法在保持有限计算成本的同时，显著缩小了稠密模型与稀疏模型之间的性能差距。进一步地，当与现有量化方法结合时，AST能将语言模型压缩至稠密FP32精度模型的16倍，且性能损失极小。在Llama2-7B模型上，通过使用少于0.4%的预训练词元，AST将稠密模型与半结构化稀疏模型在多项零样本任务中的零样本准确率差距缩小至1.12%，超越了以往最先进的方法。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日