Bayes Conditional Distribution Estimation for Knowledge Distillation Based on Conditional Mutual Information

It is believed that in knowledge distillation (KD), the role of the teacher is to provide an estimate for the unknown Bayes conditional probability distribution (BCPD) to be used in the student training process. Conventionally, this estimate is obtained by training the teacher using maximum log-likelihood (MLL) method. To improve this estimate for KD, in this paper we introduce the concept of conditional mutual information (CMI) into the estimation of BCPD and propose a novel estimator called the maximum CMI (MCMI) method. Specifically, in MCMI estimation, both the log-likelihood and CMI of the teacher are simultaneously maximized when the teacher is trained. Through Eigen-CAM, it is further shown that maximizing the teacher's CMI value allows the teacher to capture more contextual information in an image cluster. Via conducting a thorough set of experiments, we show that by employing a teacher trained via MCMI estimation rather than one trained via MLL estimation in various state-of-the-art KD frameworks, the student's classification accuracy consistently increases, with the gain of up to 3.32\%. This suggests that the teacher's BCPD estimate provided by MCMI method is more accurate than that provided by MLL method. In addition, we show that such improvements in the student's accuracy are more drastic in zero-shot and few-shot settings. Notably, the student's accuracy increases with the gain of up to 5.72\% when 5\% of the training samples are available to the student (few-shot), and increases from 0\% to as high as 84\% for an omitted class (zero-shot). The code is available at \url{https://github.com/iclr2024mcmi/ICLRMCMI}.

翻译：在知识蒸馏（KD）中，一般认为教师的作用是为学生训练过程提供未知贝叶斯条件概率分布（BCPD）的估计。传统上，该估计通过使用最大对数似然（MLL）方法训练教师获得。为改进知识蒸馏中的这一估计，本文引入条件互信息（CMI）概念用于贝叶斯条件概率分布的估计，并提出一种名为最大条件互信息（MCMI）的新估计方法。具体而言，在MCMI估计中，教师训练时同时最大化其对数似然和条件互信息。通过Eigen-CAM进一步表明，最大化教师的CMI值使其能够捕捉图像簇中更多的上下文信息。通过开展全面实验，我们证明在多种先进知识蒸馏框架中，采用MCMI估计训练的教师（而非MLL估计训练的教师）时，学生的分类准确率持续提升，最高增益达3.32%。这表明MCMI方法提供的教师BCPD估计比MLL方法更为精确。此外，我们证明在零样本和少样本场景下，学生准确率的提升更为显著。值得注意的是，当学生仅能获取5%训练样本（少样本）时，其准确率提升高达5.72%；对于缺失类别（零样本），准确率从0%提升至84%。代码已开源在 \url{https://github.com/iclr2024mcmi/ICLRMCMI}。