Large language models (LLMs) enable unparalleled few- and zero-shot reasoning capabilities but at a high computational footprint. A growing assortment of methods for compression promises to reduce the computational burden of LLMs in deployment, but so far, only quantization approaches have been demonstrated to be effective for LLM compression while maintaining zero-shot performance. A critical step in the compression process, the pretrain-then-finetune paradigm, has largely been overlooked when adapting existing pruning strategies to LLMs or proposing new ones. In this work, we show that embarrassingly simple layer pruning coupled with an extended language model pretraining as the finetuning phase produces state-of-the-art results against structured and even semi-structured compression of models at a 7B scale while being more inference efficient. We call this method LayerChop, where we deterministically remove layers from a model followed by task-agnostic finetuning of the remaining weights by continued self-supervised pretraining. At this scale, we also show how distillation, which has been super effective in task-agnostic compression of smaller BERT-style models, becomes inefficient against our simple pruning technique.
翻译:大语言模型(LLM)实现了卓越的少样本与零样本推理能力,但其计算开销巨大。日益增多的压缩方法有望降低部署LLM时的计算负担,然而迄今为止,仅量化方法被证明能在保持零样本性能的同时有效压缩LLM。在将现有剪枝策略适配于LLM或提出新方法时,压缩过程中的关键环节——预训练后微调范式——大多被忽视。本研究表明,结合扩展语言模型预训练作为微调阶段的极其简单的层级剪枝方法,在7B规模模型上实现了优于结构化甚至半结构化压缩的先进效果,同时具有更高的推理效率。我们将此方法称为LayerChop:确定性地移除模型中的层级,随后通过持续自监督预训练对剩余权重进行任务无关的微调。在此规模下,我们还展示了蒸馏方法(在较小BERT类模型的任务无关压缩中极为有效)相对于我们简单剪枝技术的低效性。