In this work, we propose Mutual Information Maximization Knowledge Distillation (MIMKD). Our method uses a contrastive objective to simultaneously estimate and maximize a lower bound on the mutual information of local and global feature representations between a teacher and a student network. We demonstrate through extensive experiments that this can be used to improve the performance of low capacity models by transferring knowledge from more performant but computationally expensive models. This can be used to produce better models that can be run on devices with low computational resources. Our method is flexible, we can distill knowledge from teachers with arbitrary network architectures to arbitrary student networks. Our empirical results show that MIMKD outperforms competing approaches across a wide range of student-teacher pairs with different capacities, with different architectures, and when student networks are with extremely low capacity. We are able to obtain 74.55% accuracy on CIFAR100 with a ShufflenetV2 from a baseline accuracy of 69.8% by distilling knowledge from ResNet-50. On Imagenet we improve a ResNet-18 network from 68.88% to 70.32% accuracy (1.44%+) using a ResNet-34 teacher network.
翻译:本文提出互信息最大化知识蒸馏方法(MIMKD)。该方法利用对比学习目标,同时估计并最大化教师网络与学生网络之间局部与全局特征表示互信息的下界。通过大量实验证明,该方法能将性能更优但计算成本更高的模型的知识迁移至低容量模型,从而提升其性能。这使得我们能够构建出适合在低计算资源设备上运行的优秀模型。本方法具有高度灵活性,可将任意架构的教师网络知识蒸馏至任意学生网络。实验结果表明,在不同容量、不同架构以及学生网络容量极低的情况下,MIMKD在各种师生网络对中均优于现有方法。通过将ResNet-50教师网络的知识蒸馏至ShufflenetV2学生网络,我们在CIFAR100数据集上将基准准确率从69.8%提升至74.55%。在ImageNet数据集上,使用ResNet-34教师网络将ResNet-18网络的准确率从68.88%提升至70.32%(提升1.44%)。