Knowledge Distillation (KD) seeks to transfer the knowledge of a teacher, towards a student neural net. This process is often done by matching the networks' predictions (i.e., their output), but, recently several works have proposed to match the distributions of neural nets' activations (i.e., their features), a process known as \emph{distribution matching}. In this paper, we propose an unifying framework, Knowledge Distillation through Distribution Matching (KD$^{2}$M), which formalizes this strategy. Our contributions are threefold. We i) provide an overview of distribution metrics used in distribution matching, ii) benchmark on computer vision datasets, and iii) derive new theoretical results for KD.
翻译:知识蒸馏(Knowledge Distillation, KD)旨在将教师模型的知识迁移至学生神经网络。这一过程通常通过匹配网络的预测(即其输出)来实现,但近期多项研究提出匹配神经网络激活的分布(即其特征),这一过程被称为\emph{分布匹配}。本文提出一个统一框架——基于分布匹配的知识蒸馏(KD$^{2}$M),以形式化这一策略。我们的贡献有三方面:i) 概述了分布匹配中使用的分布度量方法;ii) 在计算机视觉数据集上进行了基准测试;iii) 为知识蒸馏推导了新的理论结果。