We present a comprehensive report on compressing the Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters, respectively, using pruning and distillation. We explore two distinct pruning strategies: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning, and evaluate the results on common benchmarks from the LM Evaluation Harness. The models are then aligned with NeMo Aligner and tested in instruct-tuned versions. This approach produces a compelling 4B model from Llama 3.1 8B and a state-of-the-art Mistral-NeMo-Minitron-8B (MN-Minitron-8B for brevity) model from Mistral NeMo 12B. We found that with no access to the original data, it is beneficial to slightly fine-tune teacher models on the distillation dataset. We open-source our base model weights on Hugging Face with a permissive license.
翻译:我们提出了一份关于使用剪枝与蒸馏技术将Llama 3.1 8B和Mistral NeMo 12B模型分别压缩至4B和8B参数的全面报告。我们探索了两种不同的剪枝策略:(1) 深度剪枝与(2) 联合隐藏层/注意力/MLP(宽度)剪枝,并在LM评估工具集的常用基准测试上评估了结果。随后,这些模型通过NeMo Aligner进行对齐,并以指令调优版本进行测试。该方法从Llama 3.1 8B中得到了一个引人注目的4B模型,并从Mistral NeMo 12B中得到了一个最先进的Mistral-NeMo-Minitron-8B(简称为MN-Minitron-8B)模型。我们发现,在无法获取原始数据的情况下,在蒸馏数据集上对教师模型进行轻微微调是有益的。我们已在Hugging Face上以宽松许可证开源了我们的基础模型权重。