Existing knowledge distillation (KD) methods have demonstrated their ability in achieving student network performance on par with their teachers. However, the knowledge gap between the teacher and student remains significant and may hinder the effectiveness of the distillation process. In this work, we introduce the structure of Neural Collapse (NC) into the KD framework. NC typically occurs in the final phase of training, resulting in a graceful geometric structure where the last-layer features form a simplex equiangular tight frame. Such phenomenon has improved the generalization of deep network training. We hypothesize that NC can also alleviate the knowledge gap in distillation, thereby enhancing student performance. This paper begins with an empirical analysis to bridge the connection between knowledge distillation and neural collapse. Through this analysis, we establish that transferring the teacher's NC structure to the student benefits the distillation process. Therefore, instead of merely transferring instance-level logits or features, as done by existing distillation methods, we encourage students to learn the teacher's NC structure. Thereby, we propose a new distillation paradigm termed Neural Collapse-inspired Knowledge Distillation (NCKD). Comprehensive experiments demonstrate that NCKD is simple yet effective, improving the generalization of all distilled student models and achieving state-of-the-art accuracy performance.
翻译:现有的知识蒸馏(KD)方法已证明其能够使学生的网络性能达到与教师网络相当的水平。然而,教师与学生之间的知识差距仍然显著,并可能阻碍蒸馏过程的有效性。在本工作中,我们将神经坍缩(NC)的结构引入知识蒸馏框架。神经坍缩通常发生在训练的最后阶段,会产生一种优雅的几何结构,其中最后一层特征形成一个单纯形等角紧框架。这种现象改善了深度网络训练的泛化能力。我们假设神经坍缩也能缓解蒸馏过程中的知识差距,从而提升学生网络的性能。本文首先通过实证分析来建立知识蒸馏与神经坍缩之间的联系。通过该分析,我们证实将教师的神经坍缩结构传递给学生有利于蒸馏过程。因此,不同于现有蒸馏方法仅传递实例级别的逻辑值或特征,我们鼓励学生学习教师的神经坍缩结构。由此,我们提出了一种新的蒸馏范式,称为神经坍缩启发的知识蒸馏(NCKD)。全面的实验表明,NCKD 方法简单而有效,提升了所有被蒸馏学生模型的泛化能力,并实现了最先进的准确率性能。