Logit knowledge distillation attracts increasing attention due to its practicality in recent studies. However, it often suffers inferior performance compared to the feature knowledge distillation. In this paper, we argue that existing logit-based methods may be sub-optimal since they only leverage the global logit output that couples multiple semantic knowledge. This may transfer ambiguous knowledge to the student and mislead its learning. To this end, we propose a simple but effective method, i.e., Scale Decoupled Distillation (SDD), for logit knowledge distillation. SDD decouples the global logit output into multiple local logit outputs and establishes distillation pipelines for them. This helps the student to mine and inherit fine-grained and unambiguous logit knowledge. Moreover, the decoupled knowledge can be further divided into consistent and complementary logit knowledge that transfers the semantic information and sample ambiguity, respectively. By increasing the weight of complementary parts, SDD can guide the student to focus more on ambiguous samples, improving its discrimination ability. Extensive experiments on several benchmark datasets demonstrate the effectiveness of SDD for wide teacher-student pairs, especially in the fine-grained classification task. Code is available at: https://github.com/shicaiwei123/SDD-CVPR2024
翻译:Logit知识蒸馏因其在近期的研究中的实用性而受到越来越多的关注。然而,与特征知识蒸馏相比,它通常表现欠佳。在本文中,我们认为现有的基于logit的方法可能是次优的,因为它们仅利用了耦合多种语义知识的全局logit输出。这可能会将模糊的知识传递给学生模型,并误导其学习。为此,我们提出一种简单而有效的方法,即尺度解耦蒸馏(SDD),用于logit知识蒸馏。SDD将全局logit输出解耦为多个局部logit输出,并为它们建立蒸馏管道。这有助于学生模型挖掘和继承细粒度且明确的logit知识。此外,解耦后的知识可进一步分为一致性和互补性logit知识,分别传递语义信息和样本模糊性。通过增加互补部分的权重,SDD可以引导学生模型更关注模糊样本,从而提高其判别能力。在多个基准数据集上的大量实验证明了SDD在广泛的师生模型对中的有效性,特别是在细粒度分类任务中。代码可访问:https://github.com/shicaiwei123/SDD-CVPR2024