The rapid development in the performance of large language models (LLMs) is accompanied by the escalation of model size, leading to the increasing cost of model training and inference. Previous research has discovered that certain layers in LLMs exhibit redundancy, and removing these layers brings only marginal loss in model performance. In this paper, we adopt the probing technique to explain the layer redundancy in LLMs and demonstrate that language models can be effectively pruned with probing classifiers. We propose chip-tuning, a simple and effective structured pruning framework specialized for classification problems. Chip-tuning attaches tiny probing classifiers named chips to different layers of LLMs, and trains chips with the backbone model frozen. After selecting a chip for classification, all layers subsequent to the attached layer could be removed with marginal performance loss. Experimental results on various LLMs and datasets demonstrate that chip-tuning significantly outperforms previous state-of-the-art baselines in both accuracy and pruning ratio, achieving a pruning ratio of up to 50%. We also find that chip-tuning could be applied on multimodal models, and could be combined with model finetuning, proving its excellent compatibility.
翻译:大型语言模型(LLM)性能的快速发展伴随着模型规模的急剧扩大,导致模型训练与推理成本日益攀升。先前研究发现,LLM中的某些层存在冗余性,移除这些层仅会带来微小的性能损失。本文采用探针技术解释LLM中的层冗余现象,并证明通过探针分类器能有效对语言模型进行剪枝。我们提出芯片调优——一个专为分类问题设计的简洁高效结构化剪枝框架。该框架在LLM的不同层级附加名为"芯片"的微型探针分类器,并在主干模型冻结状态下训练这些芯片。选定某个芯片进行分类任务后,其附着层之后的所有层级均可被移除而仅产生边际性能损失。在不同LLM和数据集上的实验结果表明,芯片调优在准确率和剪枝率方面均显著优于现有先进基线方法,最高可实现50%的剪枝率。我们还发现芯片调优可扩展至多模态模型,并能与模型微调技术结合,证明了其卓越的兼容性。