Pruning has become a widely adopted technique for reducing the hardware requirements of large language models (LLMs). To recover model performance after pruning, post-training is commonly employed to mitigate the resulting performance degradation. While post-training benefits from larger datasets, once the dataset size is already substantial, increasing the training data provides only limited performance gains. To balance post-training cost and model performance, it is necessary to explore the optimal amount of post-training data.Through extensive experiments on the Llama-3 and Qwen-2.5 series models, pruned using various common pruning methods, we uncover the scaling \textbf{Law} for \textbf{P}ost-training after model \textbf{P}runing, referred to as the P$^2$ Law.This law identifies four key factors for predicting the pruned model's post-training loss: the model size before pruning, the number of post-training tokens, the pruning rate, and the model's loss before pruning. Moreover, P$^2$ Law can generalize to larger dataset sizes, larger model sizes, and higher pruning rates, offering valuable insights for the post-training of pruned LLMs.
翻译:剪枝已成为一种广泛采用的、用于降低大语言模型硬件需求的技术。为恢复剪枝后的模型性能,通常采用后训练来缓解由此产生的性能下降。虽然后训练受益于更大的数据集,但一旦数据集规模已经相当庞大,增加训练数据只能带来有限的性能提升。为平衡后训练成本与模型性能,有必要探索后训练数据的最优数量。通过对Llama-3和Qwen-2.5系列模型(使用多种常见剪枝方法进行剪枝)的大量实验,我们揭示了模型剪枝后训练的缩放定律,称为P$^2$定律。该定律确定了预测剪枝模型后训练损失的四个关键因素:剪枝前的模型规模、后训练词元数量、剪枝率以及剪枝前模型的损失。此外,P$^2$定律能够泛化到更大的数据集规模、更大的模型规模以及更高的剪枝率,为剪枝后大语言模型的后训练提供了有价值的见解。