Existing Knowledge Distillation (KD) methods often align feature information between teacher and student by exploring meaningful feature processing and loss functions. However, due to the difference in feature distributions between the teacher and student, the student model may learn incompatible information from the teacher. To address this problem, we propose teacher-guided student Diffusion Self-KD, dubbed as DSKD. Instead of the direct teacher-student alignment, we leverage the teacher classifier to guide the sampling process of denoising student features through a light-weight diffusion model. We then propose a novel locality-sensitive hashing (LSH)-guided feature distillation method between the original and denoised student features. The denoised student features encapsulate teacher knowledge and could be regarded as a teacher role. In this way, our DSKD method could eliminate discrepancies in mapping manners and feature distributions between the teacher and student, while learning meaningful knowledge from the teacher. Experiments on visual recognition tasks demonstrate that DSKD significantly outperforms existing KD methods across various models and datasets. Our code is attached in supplementary material.
翻译:现有的知识蒸馏方法通常通过探索有意义的特征处理与损失函数来对齐教师模型与学生模型之间的特征信息。然而,由于教师与学生特征分布存在差异,学生模型可能从教师处学习到不相容的信息。为解决此问题,我们提出教师引导式学生扩散自知识蒸馏方法,简称为DSKD。该方法不直接进行师生对齐,而是利用教师分类器通过轻量级扩散模型引导去噪学生特征的采样过程。随后,我们提出一种新颖的基于局部敏感哈希的原始学生特征与去噪学生特征间的特征蒸馏方法。去噪后的学生特征封装了教师知识,可被视为教师角色。通过这种方式,DSKD方法能够消除师生间映射方式与特征分布的差异,同时从教师处学习有意义的知识。视觉识别任务的实验表明,DSKD在不同模型与数据集上均显著优于现有知识蒸馏方法。代码已附于补充材料中。