We present a large-scale empirical study of how choices of configuration parameters affect performance in knowledge distillation (KD). An example of such a KD parameter is the measure of distance between the predictions of the teacher and the student, common choices for which include the mean squared error (MSE) and the KL-divergence. Although scattered efforts have been made to understand the differences between such options, the KD literature still lacks a systematic study on their general effect on student performance. We take an empirical approach to this question in this paper, seeking to find out the extent to which such choices influence student performance across 13 datasets from 4 NLP tasks and 3 student sizes. We quantify the cost of making sub-optimal choices and identify a single configuration that performs well across the board.
翻译:我们开展了一项大规模实证研究,探究知识蒸馏(KD)中配置参数选择对性能的影响。此类KD参数的一个典型例子是教师模型与学生模型预测结果之间的距离度量,常见选择包括均方误差(MSE)和KL散度。尽管已有零散研究试图理解不同参数选项之间的差异,但KD领域仍缺乏关于这些参数选项对学生模型性能普遍影响的系统性研究。本文采用实证方法,旨在探究这种参数选择在4项自然语言处理任务的13个数据集及3种学生模型规模下对学生模型性能的影响程度。我们量化了次优参数选择的代价,并确定了一个在各类任务中均表现优异的统一配置方案。