Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity. We achieve this for the LLaMA-2 7B model by combining the SparseGPT one-shot pruning method and sparse pretraining of those models on a subset of the SlimPajama dataset mixed with a Python subset of The Stack dataset. We exhibit training acceleration due to sparsity on Cerebras CS-3 chips that closely matches theoretical scaling. In addition, we establish inference acceleration of up to 3x on CPUs by utilizing Neural Magic's DeepSparse engine and 1.7x on GPUs through Neural Magic's nm-vllm engine. The above gains are realized via sparsity alone, thus enabling further gains through additional use of quantization. Specifically, we show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x. We demonstrate these results across diverse, challenging tasks, including chat, instruction following, code generation, arithmetic reasoning, and summarization to prove their generality. This work paves the way for rapidly creating smaller and faster LLMs without sacrificing accuracy.
翻译:大型语言模型(LLMs)已彻底改变自然语言处理(NLP),但其规模带来了计算瓶颈。我们提出了一种新颖方法,用于创建高性能LLM的精准稀疏基础版本,在高达70%稀疏度下实现微调任务的完全精度恢复。通过结合SparseGPT单次剪枝方法,并在SlimPajama数据集子集(混合The Stack数据集的Python子集)上对模型进行稀疏预训练,我们为LLaMA-2 7B模型实现了这一目标。我们展示了在Cerebras CS-3芯片上因稀疏性带来的训练加速,该加速与理论缩放高度吻合。此外,我们利用Neural Magic的DeepSparse引擎在CPU上实现了高达3倍的推理加速,通过Neural Magic的nm-vllm引擎在GPU上实现了1.7倍的加速。上述增益仅通过稀疏性实现,因此可通过额外使用量化获得进一步提升。具体而言,我们展示了稀疏量化LLaMA模型在CPU上的总加速比高达8.6倍。我们在多样化且具有挑战性的任务(包括对话、指令遵循、代码生成、算术推理和摘要)中验证了这些结果,以证明其通用性。本研究为在不牺牲精度的情况下快速创建更小、更快的LLM铺平了道路。