In knowledge distillation, previous feature distillation methods mainly focus on the design of loss functions and the selection of the distilled layers, while the effect of the feature projector between the student and the teacher remains under-explored. In this paper, we first discuss a plausible mechanism of the projector with empirical evidence and then propose a new feature distillation method based on a projector ensemble for further performance improvement. We observe that the student network benefits from a projector even if the feature dimensions of the student and the teacher are the same. Training a student backbone without a projector can be considered as a multi-task learning process, namely achieving discriminative feature extraction for classification and feature matching between the student and the teacher for distillation at the same time. We hypothesize and empirically verify that without a projector, the student network tends to overfit the teacher's feature distributions despite having different architecture and weights initialization. This leads to degradation on the quality of the student's deep features that are eventually used in classification. Adding a projector, on the other hand, disentangles the two learning tasks and helps the student network to focus better on the main feature extraction task while still being able to utilize teacher features as a guidance through the projector. Motivated by the positive effect of the projector in feature distillation, we propose an ensemble of projectors to further improve the quality of student features. Experimental results on different datasets with a series of teacher-student pairs illustrate the effectiveness of the proposed method.
翻译:在知识蒸馏中,以往的特征蒸馏方法主要关注损失函数的设计和蒸馏层的选择,而学生网络与教师网络之间特征投影器的作用仍未被充分探索。本文首先通过实证证据探讨了投影器的一种合理机制,随后提出了一种基于投影器集成的特征蒸馏新方法以进一步提升性能。我们观察到,即使学生网络与教师网络的特征维度相同,学生网络仍能从投影器中获益。不考虑投影器而直接训练学生骨干网络可被视为多任务学习过程——即同时实现分类的判别性特征提取以及学生与教师之间的特征匹配。我们假设并通过实证验证:在没有投影器的情况下,尽管学生网络具有不同的架构和权重初始化,它仍倾向于过拟合教师的特征分布。这导致最终用于分类的学生网络深层特征质量下降。而添加投影器则能分离上述两个学习任务,帮助学生网络更专注于主要特征提取任务,同时仍能通过投影器利用教师特征进行指导。受投影器在特征蒸馏中积极作用的启发,我们提出集成多个投影器以进一步改善学生特征质量。在不同数据集上通过一系列教师-学生配对进行的实验结果表明了所提方法的有效性。