Knowledge distillation (KD) is a highly promising method for mitigating the computational problems of pre-trained language models (PLMs). Among various KD approaches, Intermediate Layer Distillation (ILD) has been a de facto standard KD method with its performance efficacy in the NLP field. In this paper, we find that existing ILD methods are prone to overfitting to training datasets, although these methods transfer more information than the original KD. Next, we present the simple observations to mitigate the overfitting of ILD: distilling only the last Transformer layer and conducting ILD on supplementary tasks. Based on our two findings, we propose a simple yet effective consistency-regularized ILD (CR-ILD), which prevents the student model from overfitting the training dataset. Substantial experiments on distilling BERT on the GLUE benchmark and several synthetic datasets demonstrate that our proposed ILD method outperforms other KD techniques. Our code is available at https://github.com/jongwooko/CR-ILD.
翻译:知识蒸馏(KD)是缓解预训练语言模型(PLMs)计算问题的一种极具前景的方法。在众多KD方法中,中间层蒸馏(ILD)凭借其在自然语言处理领域的性能优势,已成为事实上的标准KD方法。在本文中,我们发现现有ILD方法虽比原始KD传递了更多信息,却容易在训练数据集上发生过拟合。接着,我们提出缓解ILD过拟合的简单观察:仅蒸馏最后一个Transformer层,并在辅助任务上执行ILD。基于这两项发现,我们提出一种简单而有效的一致性正则化中间层蒸馏(CR-ILD),它能防止学生模型在训练数据集上过拟合。在GLUE基准及若干合成数据集上对BERT进行蒸馏的大量实验表明,我们提出的ILD方法优于其他KD技术。我们的代码发布在https://github.com/jongwooko/CR-ILD。