Improving Knowledge Distillation via Regularizing Feature Norm and Direction

Knowledge distillation (KD) exploits a large well-trained model (i.e., teacher) to train a small student model on the same dataset for the same task. Treating teacher features as knowledge, prevailing methods of knowledge distillation train student by aligning its features with the teacher's, e.g., by minimizing the KL-divergence between their logits or L2 distance between their intermediate features. While it is natural to believe that better alignment of student features to the teacher better distills teacher knowledge, simply forcing this alignment does not directly contribute to the student's performance, e.g., classification accuracy. In this work, we propose to align student features with class-mean of teacher features, where class-mean naturally serves as a strong classifier. To this end, we explore baseline techniques such as adopting the cosine distance based loss to encourage the similarity between student features and their corresponding class-means of the teacher. Moreover, we train the student to produce large-norm features, inspired by other lines of work (e.g., model pruning and domain adaptation), which find the large-norm features to be more significant. Finally, we propose a rather simple loss term (dubbed ND loss) to simultaneously (1) encourage student to produce large-\emph{norm} features, and (2) align the \emph{direction} of student features and teacher class-means. Experiments on standard benchmarks demonstrate that our explored techniques help existing KD methods achieve better performance, i.e., higher classification accuracy on ImageNet and CIFAR100 datasets, and higher detection precision on COCO dataset. Importantly, our proposed ND loss helps the most, leading to the state-of-the-art performance on these benchmarks. The source code is available at \url{https://github.com/WangYZ1608/Knowledge-Distillation-via-ND}.

翻译：知识蒸馏（KD）利用一个在相同数据集上针对相同任务训练好的大型模型（即教师模型）来训练一个小型学生模型。将教师特征视为知识，现有知识蒸馏方法通过使学生特征与教师特征对齐来训练学生，例如最小化两者logits之间的KL散度或中间特征之间的L2距离。尽管人们自然认为学生特征与教师特征的对齐越紧密，越能更好地蒸馏教师知识，但单纯强制这种对齐并不能直接提升学生性能（如分类准确率）。本文提出将学生特征与教师特征的类别均值对齐，其中类别均值天然地充当强分类器。为此，我们探索了基线技术，例如采用基于余弦距离的损失函数来增强学生特征与教师对应类别均值之间的相似性。此外，受其他工作（如模型剪枝和领域自适应）的启发——这些研究发现大范数特征更具重要性——我们训练学生生成大范数特征。最后，我们提出一个极为简单的损失项（称为ND损失），同时实现：（1）鼓励学生生成大**范数**特征；（2）对齐学生特征的**方向**与教师类别均值。在标准基准上的实验表明，我们探索的技术能帮助现有知识蒸馏方法取得更好性能，即在ImageNet和CIFAR100数据集上获得更高分类准确率，在COCO数据集上获得更高检测精度。重要的是，我们提出的ND损失贡献最大，在这些基准上达到了最先进的性能。源代码已公开于\url{https://github.com/WangYZ1608/Knowledge-Distillation-via-ND}。