Cosine Similarity Knowledge Distillation for Individual Class Information Transfer

Previous logits-based Knowledge Distillation (KD) have utilized predictions about multiple categories within each sample (i.e., class predictions) and have employed Kullback-Leibler (KL) divergence to reduce the discrepancy between the student and teacher predictions. Despite the proliferation of KD techniques, the student model continues to fall short of achieving a similar level as teachers. In response, we introduce a novel and effective KD method capable of achieving results on par with or superior to the teacher models performance. We utilize teacher and student predictions about multiple samples for each category (i.e., batch predictions) and apply cosine similarity, a commonly used technique in Natural Language Processing (NLP) for measuring the resemblance between text embeddings. This metric's inherent scale-invariance property, which relies solely on vector direction and not magnitude, allows the student to dynamically learn from the teacher's knowledge, rather than being bound by a fixed distribution of the teacher's knowledge. Furthermore, we propose a method called cosine similarity weighted temperature (CSWT) to improve the performance. CSWT reduces the temperature scaling in KD when the cosine similarity between the student and teacher models is high, and conversely, it increases the temperature scaling when the cosine similarity is low. This adjustment optimizes the transfer of information from the teacher to the student model. Extensive experimental results show that our proposed method serves as a viable alternative to existing methods. We anticipate that this approach will offer valuable insights for future research on model compression.

翻译：先前的基于logits的知识蒸馏方法利用每个样本中关于多个类别的预测（即类别预测），并采用KL散度来减小学生与教师预测之间的差异。尽管知识蒸馏技术层出不穷，但学生模型的性能仍未能达到教师模型的同等水平。为此，我们提出了一种新颖且有效的知识蒸馏方法，能够实现与教师模型性能相当甚至更优的结果。我们利用每个类别中关于多个样本的教师和学生预测（即批量预测），并应用自然语言处理领域中常用于衡量文本嵌入相似度的余弦相似度。该度量固有的尺度不变性——仅依赖向量方向而非幅值——使学生能够动态学习教师的知识，而非受限于教师知识的固定分布。此外，我们提出了一种称为余弦相似度加权温度的方法来提升性能。当学生与教师模型间的余弦相似度较高时，CSWT降低蒸馏中的温度缩放；反之，当余弦相似度较低时则提高温度缩放。这种调整优化了从教师到学生模型的信息传递。大量实验结果表明，我们的方法可作为现有方法的有效替代方案。我们预期该方法将为模型压缩领域的未来研究提供有价值的见解。