Recently, various intermediate layer distillation (ILD) objectives have been shown to improve compression of BERT models via Knowledge Distillation (KD). However, a comprehensive evaluation of the objectives in both task-specific and task-agnostic settings is lacking. To the best of our knowledge, this is the first work comprehensively evaluating distillation objectives in both settings. We show that attention transfer gives the best performance overall. We also study the impact of layer choice when initializing the student from the teacher layers, finding a significant impact on the performance in task-specific distillation. For vanilla KD and hidden states transfer, initialisation with lower layers of the teacher gives a considerable improvement over higher layers, especially on the task of QNLI (up to an absolute percentage change of 17.8 in accuracy). Attention transfer behaves consistently under different initialisation settings. We release our code as an efficient transformer-based model distillation framework for further studies.
翻译:近来,多种中间层蒸馏目标被证明能通过知识蒸馏提升BERT模型的压缩效果。然而,目前缺乏对这些目标在任务特定与任务无关两种场景下的全面评估。据我们所知,本文是首个在两种场景下系统评估蒸馏目标的研究。我们发现注意力迁移在整体上取得了最佳性能。同时,我们研究了从教师层初始化学生网络时的层选择影响,发现其对任务特定蒸馏的性能具有显著影响。对于普通知识蒸馏和隐藏状态迁移,使用教师较低层进行初始化相较于较高层能带来显著提升,尤其在QNLI任务上(准确率绝对百分比变化高达17.8)。注意力迁移在不同初始化设置下表现一致。我们公开了代码,作为基于Transformer的高效模型蒸馏框架,供后续研究使用。