Understanding the Effects of Projectors in Knowledge Distillation

Conventionally, during the knowledge distillation process (e.g. feature distillation), an additional projector is often required to perform feature transformation due to the dimension mismatch between the teacher and the student networks. Interestingly, we discovered that even if the student and the teacher have the same feature dimensions, adding a projector still helps to improve the distillation performance. In addition, projectors even improve logit distillation if we add them to the architecture too. Inspired by these surprising findings and the general lack of understanding of the projectors in the knowledge distillation process from existing literature, this paper investigates the implicit role that projectors play but so far have been overlooked. Our empirical study shows that the student with a projector (1) obtains a better trade-off between the training accuracy and the testing accuracy compared to the student without a projector when it has the same feature dimensions as the teacher, (2) better preserves its similarity to the teacher beyond shallow and numeric resemblance, from the view of Centered Kernel Alignment (CKA), and (3) avoids being over-confident as the teacher does at the testing phase. Motivated by the positive effects of projectors, we propose a projector ensemble-based feature distillation method to further improve distillation performance. Despite the simplicity of the proposed strategy, empirical results from the evaluation of classification tasks on benchmark datasets demonstrate the superior classification performance of our method on a broad range of teacher-student pairs and verify from the aspects of CKA and model calibration that the student's features are of improved quality with the projector ensemble design.

翻译：传统上，在知识蒸馏过程（例如特征蒸馏）中，由于教师网络与学生网络之间存在维度不匹配，通常需要额外的投影器进行特征变换。有趣的是，我们发现即使学生网络与教师网络的特征维度相同，添加投影器仍有助于提升蒸馏性能。此外，若将投影器引入架构，连对数蒸馏也能得到改善。受这些令人惊讶的发现以及现有文献对知识蒸馏过程中投影器作用普遍缺乏理解的启发，本文研究了投影器所发挥但迄今被忽视的隐式作用。我们的实证研究表明，带有投影器的学生网络：（1）在与教师网络特征维度相同时，相较于无投影器的学生网络，在训练准确率与测试准确率之间取得了更优的权衡；（2）从中心核对齐（CKA）视角来看，能更好地保持与教师网络的相似性，而非仅停留在表层与数值上的近似；（3）在测试阶段避免了像教师网络那样过度自信。受投影器积极效果的启发，我们提出了一种基于投影器集成的特征蒸馏方法，以进一步提升蒸馏性能。尽管所提策略简单，但在基准数据集分类任务的评估中，实证结果表明我们的方法在广泛的教师-学生网络对组合上具有优越的分类性能，并从CKA与模型校准两方面验证了采用投影器集成设计后学生网络特征质量的提升。