Large language models (LLMs) targeting different deployment scales and sizes are currently produced by training each variant from scratch; this is extremely compute-intensive. In this paper, we investigate if pruning an existing LLM and then re-training it with a fraction (<3%) of the original training data can be a suitable alternative to repeated, full retraining. To this end, we develop a set of practical and effective compression best practices for LLMs that combine depth, width, attention and MLP pruning with knowledge distillation-based retraining; we arrive at these best practices through a detailed empirical exploration of pruning strategies for each axis, methods to combine axes, distillation strategies, and search techniques for arriving at optimal compressed architectures. We use this guide to compress the Nemotron-4 family of LLMs by a factor of 2-4x, and compare their performance to similarly-sized models on a variety of language modeling tasks. Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch; this results in compute cost savings of 1.8x for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. We have open-sourced Minitron model weights on Huggingface, with corresponding supplementary material including example code available on GitHub.
翻译:目前,针对不同部署规模和尺寸的大型语言模型(LLMs)通常通过从头训练每个变体来获得,这一过程计算成本极高。本文探讨了是否可以通过对现有LLM进行剪枝,然后使用少量(<3%)原始训练数据对其进行重新训练,作为重复、完整重新训练的可行替代方案。为此,我们开发了一套实用且有效的LLM压缩最佳实践,该方法结合了深度、宽度、注意力及MLP剪枝与基于知识蒸馏的再训练;我们通过对各维度剪枝策略、多维度组合方法、蒸馏策略以及获取最优压缩架构的搜索技术进行详细实证探索,从而确立了这些最佳实践。我们运用该指南将Nemotron-4系列LLM压缩了2-4倍,并在多种语言建模任务中将其性能与同等规模模型进行比较。使用我们的方法从已预训练的15B模型派生出8B和4B模型,每个模型所需的训练token数量相比从头训练最多可减少40倍;这使得训练完整模型系列(15B、8B和4B)的计算成本节省了1.8倍。Minitron模型在MMLU分数上相比从头训练最高提升16%,与Mistral 7B、Gemma 7B及Llama-3 8B等社区模型性能相当,并超越了文献中的前沿压缩技术。我们已在Huggingface上开源了Minitron模型权重,并在GitHub上提供了包含示例代码在内的补充材料。