Typical technique in knowledge distillation (KD) is regularizing the learning of a limited capacity model (student) by pushing its responses to match a powerful model's (teacher). Albeit useful especially in the penultimate layer and beyond, its action on student's feature transform is rather implicit, limiting its practice in the intermediate layers. To explicitly embed the teacher's knowledge in feature transform, we propose a learnable KD layer for the student which improves KD with two distinct abilities: i) learning how to leverage the teacher's knowledge, enabling to discard nuisance information, and ii) feeding forward the transferred knowledge deeper. Thus, the student enjoys the teacher's knowledge during the inference besides training. Formally, we repurpose 1x1-BN-ReLU-1x1 convolution block to assign a semantic vector to each local region according to the template (supervised by the teacher) that the corresponding region of the student matches. To facilitate template learning in the intermediate layers, we propose a novel form of supervision based on the teacher's decisions. Through rigorous experimentation, we demonstrate the effectiveness of our approach on 3 popular classification benchmarks. Code is available at: https://github.com/adagorgun/letKD-framework
翻译:典型的知识蒸馏(KD)技术通过约束有限容量模型(学生)的输出以匹配强模型(教师)的输出,来规范学生的学习过程。尽管该方法在倒数第二层及后续层中尤为有效,但其对学生特征变换的作用相对隐式,限制了其在中间层的应用。为将教师知识显式嵌入特征变换中,我们提出一种可学习的KD层,该层通过两种独特能力改进知识蒸馏:i)学习如何利用教师知识,从而丢弃无关信息;ii)将迁移的知识前向传播至更深层。因此,学生不仅在训练中,还能在推理阶段受益于教师知识。形式上,我们重新利用1x1-BN-ReLU-1x1卷积块,根据学生对应区域匹配的模板(由教师监督),为每个局部区域分配语义向量。为促进中间层的模板学习,我们提出一种基于教师决策的新型监督方式。通过严格实验,我们在3个主流分类基准上证明了方法的有效性。代码获取地址:https://github.com/adagorgun/letKD-framework