Knowledge distillation (KD) has shown very promising capabilities in transferring learning representations from large models (teachers) to small models (students). However, as the capacity gap between students and teachers becomes larger, existing KD methods fail to achieve better results. Our work shows that the `prior knowledge' is vital to KD, especially when applying large teachers. Particularly, we propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation. This means that our method also takes the teacher's feature as `input', not just `target'. Besides, we dynamically adjust the ratio of the prior knowledge during the training phase according to the feature gap, thus guiding the student in an appropriate difficulty. To evaluate the proposed method, we conduct extensive experiments on two image classification benchmarks (i.e. CIFAR100 and ImageNet) and an object detection benchmark (i.e. MS COCO. The results demonstrate the superiority of our method in performance under varying settings. Besides, our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers. More importantly, DPK provides a fast solution in teacher model selection for any given model.
翻译:知识蒸馏(KD)在将大模型(教师)的学习表征迁移至小模型(学生)方面展现出极具前景的能力。然而,随着学生与教师之间能力差距的增大,现有KD方法难以取得更佳效果。本研究表明,"先验知识"对KD至关重要,尤其是在应用大型教师模型时。具体而言,我们提出了动态先验知识(DPK),该方法在特征蒸馏前将部分教师特征整合为先验知识。这意味着我们的方法不仅将教师特征视为"目标",更将其作为"输入"。此外,我们根据特征差距在训练阶段动态调整先验知识比例,从而以适当难度引导学生。为评估所提方法,我们在两个图像分类基准(即CIFAR100和ImageNet)和一个目标检测基准(即MS COCO)上进行了大量实验。结果表明,我们的方法在不同设置下均展现出性能优越性。此外,DPK使学生模型的性能与教师模型呈正相关,这意味着我们可以通过应用更大的教师模型来进一步提升学生准确率。更重要的是,DPK为任意给定模型提供了一种快速选择教师模型的解决方案。