Recent works have shown that layer pruning can compress large language models (LLMs) while retaining strong performance on classification benchmarks with little or no finetuning. However, existing pruning techniques often suffer severe degradation on generative reasoning tasks. Through a systematic study across multiple model families, we find that tasks requiring multi-step reasoning are particularly sensitive to depth reduction. Beyond surface-level text degeneration, we observe degradation of critical algorithmic capabilities, including arithmetic computation for mathematical reasoning and balanced parenthesis generation for code synthesis. Under realistic post-training constraints, without access to pretraining-scale data or compute, we evaluate a simple mitigation strategy based on supervised finetuning with Self-Generated Responses. This approach achieves strong recovery on classification tasks, retaining up to 90\% of baseline performance, and yields substantial gains of up to 20--30 percentage points on generative benchmarks compared to prior post-pruning techniques. Crucially, despite these gains, recovery for generative reasoning remains fundamentally limited relative to classification tasks and is viable primarily at lower pruning ratios. Overall, we characterize the practical limits of layer pruning for generative reasoning and provide guidance on when depth reduction can be applied effectively under constrained post-training regimes.
翻译:近期研究表明,层剪枝技术能够在极少或无需微调的情况下压缩大语言模型,同时保持其在分类基准测试中的优异性能。然而,现有剪枝方法在生成式推理任务上往往出现严重性能退化。通过对多个模型系列的系统性研究,我们发现需要多步推理的任务对深度缩减尤为敏感。除表层文本退化外,我们观察到关键算法能力的衰减,包括数学推理所需的算术运算能力与代码合成所需的平衡括号生成能力。在现实的后训练约束条件下(即无法获取预训练规模的数据或算力),我们评估了一种基于自生成响应的监督微调简易缓解策略。该方法在分类任务上实现了强劲的性能恢复,可保持基线性能的90%,同时在生成式基准测试中较现有后剪枝技术获得高达20-30个百分点的显著提升。值得注意的是,尽管取得这些进展,生成式推理的性能恢复相对于分类任务仍存在本质局限,且主要适用于较低剪枝比率。总体而言,我们界定了层剪枝在生成式推理任务中的实际极限,并为在受限后训练机制下如何有效应用深度缩减提供了指导原则。