Compact Language Models via Pruning and Knowledge Distillation

Saurav Muralidharan,Sharath Turuvekere Sreenivas,Raviraj Joshi,Marcin Chochowski,Mostofa Patwary,Mohammad Shoeybi,Bryan Catanzaro,Jan Kautz,Pavlo Molchanov

Large language models (LLMs) targeting different deployment scales and sizes are currently produced by training each variant from scratch; this is extremely compute-intensive. In this paper, we investigate if pruning an existing LLM and then re-training it with a fraction (<3%) of the original training data can be a suitable alternative to repeated, full retraining. To this end, we develop a set of practical and effective compression best practices for LLMs that combine depth, width, attention and MLP pruning with knowledge distillation-based retraining; we arrive at these best practices through a detailed empirical exploration of pruning strategies for each axis, methods to combine axes, distillation strategies, and search techniques for arriving at optimal compressed architectures. We use this guide to compress the Nemotron-4 family of LLMs by a factor of 2-4x, and compare their performance to similarly-sized models on a variety of language modeling tasks. Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch; this results in compute cost savings of 1.8x for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. We have open-sourced Minitron model weights on Huggingface, with corresponding supplementary material including example code available on GitHub.

翻译：目前，针对不同部署规模和尺寸的大型语言模型（LLMs）通常通过从头训练每个变体来获得，这一过程计算成本极高。本文探讨了是否可以通过对现有LLM进行剪枝，然后使用少量（<3%）原始训练数据对其进行重新训练，作为重复、完整重新训练的可行替代方案。为此，我们开发了一套实用且有效的LLM压缩最佳实践，该方法结合了深度、宽度、注意力及MLP剪枝与基于知识蒸馏的再训练；我们通过对各维度剪枝策略、多维度组合方法、蒸馏策略以及获取最优压缩架构的搜索技术进行详细实证探索，从而确立了这些最佳实践。我们运用该指南将Nemotron-4系列LLM压缩了2-4倍，并在多种语言建模任务中将其性能与同等规模模型进行比较。使用我们的方法从已预训练的15B模型派生出8B和4B模型，每个模型所需的训练token数量相比从头训练最多可减少40倍；这使得训练完整模型系列（15B、8B和4B）的计算成本节省了1.8倍。Minitron模型在MMLU分数上相比从头训练最高提升16%，与Mistral 7B、Gemma 7B及Llama-3 8B等社区模型性能相当，并超越了文献中的前沿压缩技术。我们已在Huggingface上开源了Minitron模型权重，并在GitHub上提供了包含示例代码在内的补充材料。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日