Knowledge distillation (KD) has shown very promising capabilities in transferring learning representations from large models (teachers) to small models (students). However, as the capacity gap between students and teachers becomes larger, existing KD methods fail to achieve better results. Our work shows that the `prior knowledge' is vital to KD, especially when applying large teachers. Particularly, we propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation. This means that our method also takes the teacher's feature as `input', not just `target'. Besides, we dynamically adjust the ratio of the prior knowledge during the training phase according to the feature gap, thus guiding the student in an appropriate difficulty. To evaluate the proposed method, we conduct extensive experiments on two image classification benchmarks (i.e. CIFAR100 and ImageNet) and an object detection benchmark (i.e. MS COCO. The results demonstrate the superiority of our method in performance under varying settings. Besides, our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers. More importantly, DPK provides a fast solution in teacher model selection for any given model. Our code will be released at \url{https://github.com/Cuibaby/DPK}.
翻译:知识蒸馏(KD)在将大型模型(教师)的学习表征迁移至小型模型(学生)方面已展现出非常有前景的能力。然而,随着学生与教师之间的能力差距变大,现有的KD方法无法获得更好的结果。我们的工作表明,“先验知识”对KD至关重要,尤其是在应用大型教师时。特别地,我们提出了动态先验知识(DPK),在特征蒸馏之前将部分教师特征作为先验知识整合进来。这意味着我们的方法也将教师特征作为“输入”,而不仅仅是“目标”。此外,我们根据特征差距在训练阶段动态调整先验知识的比例,从而以适当的难度引导学生。为了评估所提方法,我们在两个图像分类基准(即CIFAR100和ImageNet)以及一个目标检测基准(即MS COCO)上进行了大量实验。结果表明,我们的方法在不同设置下的性能均具有优越性。此外,我们的DPK使得学生模型的性能与教师模型的性能呈正相关,这意味着我们可以通过应用更大的教师模型进一步提高学生的准确率。更重要的是,DPK为任意给定模型提供了一种教师模型选择的快速解决方案。我们的代码将发布于\url{https://github.com/Cuibaby/DPK}。