Knowledge distillation (KD) involves transferring the knowledge from one neural network to another, often from a larger, well-trained model (teacher) to a smaller, more efficient model (student). Traditional KD methods minimize the Kullback-Leibler (KL) divergence between the probabilistic outputs of the teacher and student networks. However, this approach often overlooks crucial structural knowledge embedded within the teacher's network. In this paper, we introduce Invariant Consistency Distillation (ICD), a novel methodology designed to enhance KD by ensuring that the student model's representations are consistent with those of the teacher. Our approach combines contrastive learning with an explicit invariance penalty, capturing significantly more information from the teacher's representation of the data. Our results on CIFAR-100 demonstrate that ICD outperforms traditional KD techniques and surpasses 13 state-of-the-art methods. In some cases, the student model even exceeds the teacher model in terms of accuracy. Furthermore, we successfully transfer our method to other datasets, including Tiny ImageNet and STL-10. The code will be made public soon.
翻译:知识蒸馏(KD)涉及将知识从一个神经网络迁移到另一个神经网络,通常是从一个经过良好训练的大型模型(教师)迁移到一个更高效的小型模型(学生)。传统的KD方法最小化教师网络和学生网络概率输出之间的Kullback-Leibler(KL)散度。然而,这种方法常常忽略了嵌入在教师网络内部的关键结构知识。本文提出不变一致性蒸馏(ICD),这是一种旨在通过确保学生模型的表征与教师模型保持一致来增强KD的新方法。我们的方法将对比学习与显式的不变性惩罚相结合,从而从教师模型的数据表征中捕获显著更多的信息。我们在CIFAR-100上的实验结果表明,ICD优于传统的KD技术,并且超越了13种最先进的方法。在某些情况下,学生模型甚至在准确率方面超过了教师模型。此外,我们成功地将我们的方法迁移到其他数据集,包括Tiny ImageNet和STL-10。代码将很快公开。