Knowledge distillation (KD) is an effective method for transferring knowledge from a large, well-trained teacher model to a smaller, more efficient student model. Despite its success, one of the main challenges in KD is ensuring the efficient transfer of complex knowledge while maintaining the student's computational efficiency. Unlike previous works that applied contrastive objectives promoting explicit negative instances, we introduce Relational Representation Distillation (RRD). Our approach leverages pairwise similarities to explore and reinforce the relationships between the teacher and student models. Inspired by self-supervised learning principles, it uses a relaxed contrastive loss that focuses on similarity rather than exact replication. This method aligns the output distributions of teacher samples in a large memory buffer, improving the robustness and performance of the student model without the need for strict negative instance differentiation. Our approach demonstrates superior performance on CIFAR-100, outperforming traditional KD techniques and surpassing 13 state-of-the-art methods. It also transfers successfully to other datasets like Tiny ImageNet and STL-10. The code will be made public soon.
翻译:知识蒸馏(KD)是一种将知识从大型、训练有素的教师模型迁移到更小、更高效的学生模型的有效方法。尽管取得了成功,但知识蒸馏面临的主要挑战之一是在保持学生模型计算效率的同时,确保复杂知识的有效迁移。与以往通过对比目标促进显式负样本的工作不同,我们提出了关系表示蒸馏(RRD)。该方法利用成对相似性来探索和强化教师模型与学生模型之间的关系。受自监督学习原理的启发,它采用了一种宽松的对比损失,该损失关注相似性而非精确复制。这种方法在一个大型记忆缓冲区中对齐教师样本的输出分布,从而提高了学生模型的鲁棒性和性能,而无需进行严格的负样本区分。我们的方法在CIFAR-100数据集上展现了卓越的性能,超越了传统的知识蒸馏技术,并超过了13种最先进的方法。该方法也成功迁移到其他数据集,如Tiny ImageNet和STL-10。代码将很快公开。