Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

Abhinav Agarwalla,Abhay Gupta,Alexandre Marques,Shubhra Pandit,Michael Goin,Eldar Kurtic,Kevin Leong,Tuan Nguyen,Mahmoud Salem,Dan Alistarh,Sean Lie,Mark Kurtz

Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity. We achieve this for the LLaMA-2 7B model by combining the SparseGPT one-shot pruning method and sparse pretraining of those models on a subset of the SlimPajama dataset mixed with a Python subset of The Stack dataset. We exhibit training acceleration due to sparsity on Cerebras CS-3 chips that closely matches theoretical scaling. In addition, we establish inference acceleration of up to 3x on CPUs by utilizing Neural Magic's DeepSparse engine and 1.7x on GPUs through Neural Magic's nm-vllm engine. The above gains are realized via sparsity alone, thus enabling further gains through additional use of quantization. Specifically, we show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x. We demonstrate these results across diverse, challenging tasks, including chat, instruction following, code generation, arithmetic reasoning, and summarization to prove their generality. This work paves the way for rapidly creating smaller and faster LLMs without sacrificing accuracy.

翻译：大型语言模型（LLMs）已彻底改变自然语言处理（NLP），但其规模带来了计算瓶颈。我们提出了一种新颖方法，用于创建高性能LLM的精准稀疏基础版本，在高达70%稀疏度下实现微调任务的完全精度恢复。通过结合SparseGPT单次剪枝方法，并在SlimPajama数据集子集（混合The Stack数据集的Python子集）上对模型进行稀疏预训练，我们为LLaMA-2 7B模型实现了这一目标。我们展示了在Cerebras CS-3芯片上因稀疏性带来的训练加速，该加速与理论缩放高度吻合。此外，我们利用Neural Magic的DeepSparse引擎在CPU上实现了高达3倍的推理加速，通过Neural Magic的nm-vllm引擎在GPU上实现了1.7倍的加速。上述增益仅通过稀疏性实现，因此可通过额外使用量化获得进一步提升。具体而言，我们展示了稀疏量化LLaMA模型在CPU上的总加速比高达8.6倍。我们在多样化且具有挑战性的任务（包括对话、指令遵循、代码生成、算术推理和摘要）中验证了这些结果，以证明其通用性。本研究为在不牺牲精度的情况下快速创建更小、更快的LLM铺平了道路。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日