Pruning promises a shortcut to strong small language models. In this work, we examine this promise by pruning Llama-3.1-8B at pruning ratios of 0.5--0.8 with six methods spanning depth, width, and sparse granularities, under two controlled token-matched settings. (1) With the same training token budget, pruned initialization consistently outperforms random initialization. This shows that the parent model provides a strong starting point, although the advantage narrows as the training token budget grows and as the pruning ratio rises, nearly vanishing at the highest pruning ratio we study. (2) When training from scratch is instead given the full token budget consumed by the whole pipeline, pruning at finer granularities still retains an advantage, while coarser structured pruning can be matched or surpassed. This suggests that the parent model transfers knowledge that additional training tokens alone cannot fully recover, but only at fine granularity. Taken together, our results yield a clear recommendation: with a large pretrained model in hand and a limited training token budget, pruning is better than training from scratch; when the training budget is not limited, training from scratch can be competitive for coarser pruning, so a large pretrained parent is not always necessary.
翻译:剪枝承诺为构建强大小语言模型提供了一条捷径。本研究通过六种覆盖深度、宽度及稀疏粒度的方法,在两种受控的token匹配设置下,以0.5至0.8的剪枝率对Llama-3.1-8B模型进行剪枝,检验了这一承诺。(1)在相同训练token预算下,使用剪枝初始化始终优于随机初始化。这表明父模型提供了强有力的起点,尽管随着训练token预算增加和剪枝率上升,优势逐渐缩小,并在我们研究的最高剪枝率下几乎消失。(2)当从头训练获得整个流程所消耗的全部token预算时,较细粒度剪枝仍保持优势,而较粗的结构化剪枝可被匹配甚至超越。这表明父模型传递了仅靠额外训练token无法完全恢复的知识,但这种传递仅存在于细粒度剪枝中。综合来看,我们的结果给出了明确建议:当拥有大型预训练模型且训练token预算有限时,剪枝优于从头训练;当训练预算不受限时,从头训练可在较粗剪枝场景下具备竞争力,因此大型预训练父模型并非始终必要。