Knowledge distillation has been widely-used to improve the performance of a "student" network by hoping to mimic soft probabilities of a "teacher" network. Yet, for self-distillation to work, the student must somehow deviate from the teacher (Stanton et al., 2021). But what is the nature of these deviations, and how do they relate to gains in generalization? We investigate these questions through a series of experiments across image and language classification datasets. First, we observe that distillation consistently deviates in a characteristic way: on points where the teacher has low confidence, the student achieves even lower confidence than the teacher. Secondly, we find that deviations in the initial dynamics of training are not crucial -- simply switching to distillation loss in the middle of training can recover much of its gains. We then provide two parallel theoretical perspectives to understand the role of student-teacher deviations in our experiments, one casting distillation as a regularizer in eigenspace, and another as a gradient denoiser. Our analysis bridges several gaps between existing theory and practice by (a) focusing on gradient-descent training, (b) by avoiding label noise assumptions, and (c) by unifying several disjoint empirical and theoretical findings.
翻译:知识蒸馏已被广泛用于提升“学生”网络的性能,其核心理念是希望学生网络模仿“教师”网络的软概率输出。然而,对于自蒸馏而言,学生必须在一定程度上偏离教师(Stanton 等人,2021)。但这些偏差的本质是什么,它们与泛化性能的提升有何关联?我们通过一系列图像和语言分类数据集上的实验来探究这些问题。首先,我们观察到蒸馏始终以一种特征性的方式产生偏差:在教师网络置信度较低的数据点上,学生网络的置信度甚至比教师更低。其次,我们发现训练初始阶段的偏差并非至关重要——在训练中途切换至蒸馏损失函数即可恢复其大部分增益。随后,我们从两个平行的理论视角来理解实验中师生偏差的作用:一个视角将蒸馏视为特征空间中的正则化项,另一个视角则将其视为梯度去噪器。我们的分析弥合了现有理论与实践之间的若干差距,具体体现在:(a)聚焦于梯度下降训练,(b)避免了标签噪声假设,(c)统一了多个分散的实证与理论发现。