In this paper we revisit the efficacy of knowledge distillation as a function matching and metric learning problem. In doing so we verify three important design decisions, namely the normalisation, soft maximum function, and projection layers as key ingredients. We theoretically show that the projector implicitly encodes information on past examples, enabling relational gradients for the student. We then show that the normalisation of representations is tightly coupled with the training dynamics of this projector, which can have a large impact on the students performance. Finally, we show that a simple soft maximum function can be used to address any significant capacity gap problems. Experimental results on various benchmark datasets demonstrate that using these insights can lead to superior or comparable performance to state-of-the-art knowledge distillation techniques, despite being much more computationally efficient. In particular, we obtain these results across image classification (CIFAR100 and ImageNet), object detection (COCO2017), and on more difficult distillation objectives, such as training data efficient transformers, whereby we attain a 77.2% top-1 accuracy with DeiT-Ti on ImageNet. Code and models are publicly available.
翻译:本文重新审视了知识蒸馏作为函数匹配和度量学习问题的有效性。在此过程中,我们验证了三个关键设计决策:归一化、软最大值函数和投影层。我们理论上证明,投影仪隐式编码了先前样本的信息,从而为学生模型提供了关系梯度。接着表明,表征的归一化与投影仪的训练动态紧密耦合,这对学生模型的性能有显著影响。最后,我们证明一个简单的软最大值函数可用于解决任何显著的能力差距问题。在多个基准数据集上的实验结果表明,尽管我们的方法计算效率更高,但利用这些见解可获得与最先进知识蒸馏技术相当甚至更优的性能。具体而言,我们在图像分类(CIFAR100和ImageNet)、目标检测(COCO2017)以及更困难的蒸馏目标(如训练数据高效Transformer)上均取得了这些结果,其中使用DeiT-Ti在ImageNet上达到了77.2%的top-1准确率。代码和模型已公开。