Feature distillation makes the student mimic the intermediate features of the teacher. Nearly all existing feature-distillation methods use L2 distance or its slight variants as the distance metric between teacher and student features. However, while L2 distance is isotropic w.r.t. all dimensions, the neural network's operation on different dimensions is usually anisotropic, i.e., perturbations with the same 2-norm but in different dimensions of intermediate features lead to changes in the final output with largely different magnitude. Considering this, we argue that the similarity between teacher and student features should not be measured merely based on their appearance (i.e., L2 distance), but should, more importantly, be measured by their difference in function, namely how later layers of the network will read, decode, and process them. Therefore, we propose Function-Consistent Feature Distillation (FCFD), which explicitly optimizes the functional similarity between teacher and student features. The core idea of FCFD is to make teacher and student features not only numerically similar, but more importantly produce similar outputs when fed to the later part of the same network. With FCFD, the student mimics the teacher more faithfully and learns more from the teacher. Extensive experiments on image classification and object detection demonstrate the superiority of FCFD to existing methods. Furthermore, we can combine FCFD with many existing methods to obtain even higher accuracy. Our codes are available at https://github.com/LiuDongyang6/FCFD.
翻译:特征蒸馏使学生模型模仿教师模型的中间特征。几乎所有现有的特征蒸馏方法都使用L2距离或其轻微变体作为教师和学生特征之间的距离度量。然而,尽管L2距离在所有维度上是各向同性的,但神经网络对不同维度的操作通常是各向异性的,即对中间特征的不同维度施加相同2-范数的扰动,会导致最终输出产生幅度差异巨大的变化。基于此,我们认为教师与学生特征之间的相似性不应仅根据其外观(即L2距离)来衡量,更重要的应是它们的功能差异,即网络后续层如何读取、解码和处理这些特征。因此,我们提出函数一致性特征蒸馏(FCFD),该方法显式优化教师与学生特征之间的功能相似性。FCFD的核心思想是使教师与学生特征不仅数值上相似,更重要的是在输入同一网络后续部分时产生相似输出。通过FCFD,学生模型能更忠实地模仿教师模型,并从教师模型中学到更多知识。在图像分类和目标检测上的大量实验表明,FCFD优于现有方法。此外,我们可以将FCFD与许多现有方法结合以获得更高的准确率。我们的代码已开源在https://github.com/LiuDongyang6/FCFD。