The large scale of pre-trained language models poses a challenge for their deployment on various devices, with a growing emphasis on methods to compress these models, particularly knowledge distillation. However, current knowledge distillation methods rely on the model's intermediate layer features and the golden labels (also called hard labels), which usually require aligned model architecture and enough labeled data respectively. Moreover, the parameters of vocabulary are usually neglected in existing methods. To address these problems, we propose a general language model distillation (GLMD) method that performs two-stage word prediction distillation and vocabulary compression, which is simple and surprisingly shows extremely strong performance. Specifically, GLMD supports more general application scenarios by eliminating the constraints of dimension and structure between models and the need for labeled datasets through the absence of intermediate layers and golden labels. Meanwhile, based on the long-tailed distribution of word frequencies in the data, GLMD designs a strategy of vocabulary compression through decreasing vocabulary size instead of dimensionality. Experimental results show that our method outperforms 25 state-of-the-art methods on the SuperGLUE benchmark, achieving an average score that surpasses the best method by 3%.
翻译:预训练语言模型的大规模参数对其在各类设备上的部署构成挑战,因此模型压缩方法(尤其是知识蒸馏)日益受到重视。然而,当前知识蒸馏方法依赖于模型的中间层特征和黄金标签(亦称硬标签),这通常分别要求对齐的模型架构和充足的标注数据。此外,现有方法往往忽略词表参数。针对这些问题,我们提出一种通用语言模型蒸馏(GLMD)方法,该方法通过两阶段单词预测蒸馏与词表压缩实现,其简洁性令人惊讶地展现出极强性能。具体而言,GLMD通过去除中间层和黄金标签,消除了模型间维度与结构的约束及对标注数据集的需求,从而支持更通用的应用场景。同时,基于数据中词频的长尾分布,GLMD设计了通过缩减词表大小而非降低维度的词表压缩策略。实验结果表明,我们的方法在SuperGLUE基准测试中超越25种最先进方法,平均得分比最优方法高出3%。