It is believed that in knowledge distillation (KD), the role of the teacher is to provide an estimate for the unknown Bayes conditional probability distribution (BCPD) to be used in the student training process. Conventionally, this estimate is obtained by training the teacher using maximum log-likelihood (MLL) method. To improve this estimate for KD, in this paper we introduce the concept of conditional mutual information (CMI) into the estimation of BCPD and propose a novel estimator called the maximum CMI (MCMI) method. Specifically, in MCMI estimation, both the log-likelihood and CMI of the teacher are simultaneously maximized when the teacher is trained. Through Eigen-CAM, it is further shown that maximizing the teacher's CMI value allows the teacher to capture more contextual information in an image cluster. Via conducting a thorough set of experiments, we show that by employing a teacher trained via MCMI estimation rather than one trained via MLL estimation in various state-of-the-art KD frameworks, the student's classification accuracy consistently increases, with the gain of up to 3.32\%. This suggests that the teacher's BCPD estimate provided by MCMI method is more accurate than that provided by MLL method. In addition, we show that such improvements in the student's accuracy are more drastic in zero-shot and few-shot settings. Notably, the student's accuracy increases with the gain of up to 5.72\% when 5\% of the training samples are available to the student (few-shot), and increases from 0\% to as high as 84\% for an omitted class (zero-shot). The code is available at \url{https://github.com/iclr2024mcmi/ICLRMCMI}.
翻译:据认为,在知识蒸馏(KD)中,教师的作用是为学生训练过程提供未知贝叶斯条件概率分布(BCPD)的估计。传统上,该估计通过使用最大对数似然(MLL)方法训练教师模型获得。为改进KD中的估计,本文引入条件互信息(CMI)概念到BCPD估计中,并提出一种名为最大CMI(MCMI)的新型估计方法。具体而言,在MCMI估计中,训练教师模型时同时最大化其对数似然和CMI值。通过Eigen-CAM进一步证明,最大化教师CMI值使其能够捕获图像聚类中更多上下文信息。通过开展全面实验,我们表明:在各种最先进的KD框架中,采用MCMI估计训练的教师(而非MLL估计训练的教师)时,学生模型的分类准确率持续提升,最高增益达3.32%。这表明MCMI方法提供的教师BCPD估计比MLL方法更精确。此外,我们发现这种学生准确率的提升在零样本和少样本设置中更为显著。值得注意的是,当学生仅获得5%训练样本(少样本)时,其准确率最高提升5.72%;而对于被省略类别(零样本),准确率从0%飙升至84%。相关代码已开源至\url{https://github.com/iclr2024mcmi/ICLRMCMI}。