Recent work has shown that layer pruning can effectively compress large language models (LLMs) while retaining strong performance on classification benchmarks, often with little or no finetuning. In contrast, generative reasoning tasks, such as GSM8K and HumanEval\textsuperscript{+}, exhibit substantially weaker recovery. We show that beyond surface-level text degradation, pruning leads to a loss of key algorithmic capabilities, including arithmetic computation and balanced parenthesis generation. Under realistic post-training constraints, without access to pretraining-scale data or compute, we evaluate a minimal recovery strategy based on supervised finetuning with self-generated responses. This approach recovers up to 90\% of baseline performance on classification tasks, but recovery for generative reasoning remains fundamentally limited. Notably, even models finetuned on $\sim$400B tokens after pruning fail to recover their original reasoning performance, suggesting that such capabilities are not as easily restored. This limitation persists even on simple tasks such as arithmetic, which do not require multi-step generation. Overall, we characterize the practical limits of layer pruning for generative reasoning and provide guidance on when depth reduction is effective under constrained post-training regimes.
翻译:近期研究表明,层剪枝技术可在无需或仅需少量微调的情况下有效压缩大语言模型,并在分类基准测试中保持较强性能。相比之下,GSM8K和HumanEval\textsuperscript{+}等生成式推理任务则表现出显著更弱的恢复能力。我们证明,除表面文本退化外,剪枝会导致关键算法能力的丧失,包括算术计算和平衡括号生成。在现实的后训练约束条件下(无法访问预训练规模的数据或算力),我们评估了基于自生成响应监督微调的最小恢复策略。该方法可在分类任务中恢复基线性能的90%,但生成推理的恢复仍存在根本性局限。值得注意的是,即使在剪枝后对约400B tokens进行微调的模型,其原始推理能力也未完全恢复,表明此类能力难以复原。即便在无需多步生成的简单算术任务中,该局限性依然存在。总体而言,我们揭示了层剪枝在生成推理中的实用边界,并为受限后训练环境下深度压缩的有效性提供指导准则。