We empirically study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed. To prune these models, we identify the optimal block of layers to prune by considering similarity across layers; then, to "heal" the damage, we perform a small amount of finetuning. In particular, we use parameter-efficient finetuning (PEFT) methods, specifically quantization and Low Rank Adapters (QLoRA), such that each of our experiments can be performed on a single A100 GPU. From a practical perspective, these results suggest that layer pruning methods can complement other PEFT strategies to further reduce computational resources of finetuning on the one hand, and can improve the memory and latency of inference on the other hand. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge.
翻译:我们通过实验研究了一种针对流行开源预训练大语言模型(LLM)的简单层剪枝策略,发现在移除多达一半的层之前,模型在不同问答基准上的性能下降极小。为剪枝这些模型,我们通过分析层间相似性确定最优的层块进行移除;随后通过少量微调“修复”损伤。具体而言,我们采用参数高效微调(PEFT)方法,即量化与低秩适配器(QLoRA),使得每个实验均可单块A100 GPU上完成。从实践角度看,这些结果表明层剪枝方法可与其他PEFT策略互补,一方面进一步减少微调的计算资源消耗,另一方面降低推理过程中的内存占用与延迟。从科学视角看,这些LLM对层删除所表现出的鲁棒性暗示:要么当前预训练方法未能有效利用网络深层的参数,要么浅层在知识存储中发挥着关键作用。