Revisiting Knowledge Distillation via Label Smoothing Regularization

Knowledge Distillation (KD) aims to distill the knowledge of a cumbersome teacher model into a lightweight student model. Its success is generally attributed to the privileged information on similarities among categories provided by the teacher model, and in this sense, only strong teacher models are deployed to teach weaker students in practice. In this work, we challenge this common belief by following experimental observations: 1) beyond the acknowledgment that the teacher can improve the student, the student can also enhance the teacher significantly by reversing the KD procedure; 2) a poorly-trained teacher with much lower accuracy than the student can still improve the latter significantly. To explain these observations, we provide a theoretical analysis of the relationships between KD and label smoothing regularization. We prove that 1) KD is a type of learned label smoothing regularization and 2) label smoothing regularization provides a virtual teacher model for KD. From these results, we argue that the success of KD is not fully due to the similarity information between categories from teachers, but also to the regularization of soft targets, which is equally or even more important. Based on these analyses, we further propose a novel Teacher-free Knowledge Distillation (Tf-KD) framework, where a student model learns from itself or manuallydesigned regularization distribution. The Tf-KD achieves comparable performance with normal KD from a superior teacher, which is well applied when a stronger teacher model is unavailable. Meanwhile, Tf-KD is generic and can be directly deployed for training deep neural networks. Without any extra computation cost, Tf-KD achieves up to 0.65\% improvement on ImageNet over well-established baseline models, which is superior to label smoothing regularization.

翻译：知识蒸馏(KD)旨在将一个繁琐的教师模式的知识提炼成一个轻量级学生模式,其成功一般归功于关于教师模式所提供的不同类别相似之处的优异信息,从这个意义上说,只有强大的教师模式才能在实践中教授弱小的学生。在这项工作中,我们通过下列实验观察来质疑这一共同信念:1)除了承认教师可以改善学生,学生还可以通过改变KD程序来大大加强教师;2)教师的精度大大低于学生的精度仍然可以大大改进后者。为了解释这些观察,我们提供了对KD和标签平稳正规化之间的关系的理论分析。我们证明:(1)KD是一种学习性标签平稳规范化的型号,2)标签平稳化为KD提供了一个虚拟教师模式。我们从这些结果中说,KD的成功并非完全由于教师能够改善学生的相似性信息,而是由于软性目标的正规化,这同样或甚至更加重要。基于这些分析,我们进一步提议对KD和Speople Stilling网络之间的关系进行理论性分析。我们证明,在Sref-Def-destrill Stillation Stillation Stillation the laf laf-laf laxalalal dal dal dal dald lader is saldald lax be saldal be sal be slegildald roild be s bleglegildaldaldald a press prilate sald be slegald romald romald romaldaldaldald rogide praldaldaldald praldaldald) 使学生能够直接学习一个正常地使用一个教师的正常的正常的升级的模型可以使师的升级的升级的升级的升级的模型可以使用。