Temperature plays a pivotal role in moderating label softness in the realm of knowledge distillation (KD). Traditional approaches often employ a static temperature throughout the KD process, which fails to address the nuanced complexities of samples with varying levels of difficulty and overlooks the distinct capabilities of different teacher-student pairings. This leads to a less-than-ideal transfer of knowledge. To improve the process of knowledge propagation, we proposed Dynamic Temperature Knowledge Distillation (DTKD) which introduces a dynamic, cooperative temperature control for both teacher and student models simultaneously within each training iterafion. In particular, we proposed "\textbf{sharpness}" as a metric to quantify the smoothness of a model's output distribution. By minimizing the sharpness difference between the teacher and the student, we can derive sample-specific temperatures for them respectively. Extensive experiments on CIFAR-100 and ImageNet-2012 demonstrate that DTKD performs comparably to leading KD techniques, with added robustness in Target Class KD and None-target Class KD scenarios.The code is available at https://github.com/JinYu1998/DTKD.
翻译:温度在知识蒸馏中起着调节标签软硬度的关键作用。传统方法通常在整个蒸馏过程中采用固定温度,这未能应对不同难度样本的细微复杂性,且忽略了师生配对的不同能力,导致知识传递效果欠佳。为优化知识传播过程,我们提出了动态温度知识蒸馏(DTKD),该方法在每个训练迭代中为教师和学生模型同时引入动态、协同的温度控制。具体而言,我们提出以“锐度”作为度量模型输出分布平滑性的指标。通过最小化教师与学生之间的锐度差异,可分别为它们推导出样本特定的温度。在CIFAR-100和ImageNet-2012上的大量实验表明,DTKD的性能与主流知识蒸馏技术相当,并在目标类知识蒸馏和非目标类知识蒸馏场景中展现出更强的鲁棒性。代码已开源至https://github.com/JinYu1998/DTKD。