Knowledge distillation (KD) is a technique used to transfer knowledge from an overparameterized teacher network to a less-parameterized student network, thereby minimizing the incurred performance loss. KD methods can be categorized into offline and online approaches. Offline KD leverages a powerful pretrained teacher network, while online KD allows the teacher network to be adjusted dynamically to enhance the learning effectiveness of the student network. Recently, it has been discovered that sharing the classifier of the teacher network can significantly boost the performance of the student network with only a minimal increase in the number of network parameters. Building on these insights, we propose adaptive teaching with a shared classifier (ATSC). In ATSC, the pretrained teacher network self-adjusts to better align with the learning needs of the student network based on its capabilities, and the student network benefits from the shared classifier, enhancing its performance. Additionally, we extend ATSC to environments with multiple teachers. We conduct extensive experiments, demonstrating the effectiveness of the proposed KD method. Our approach achieves state-of-the-art results on the CIFAR-100 and ImageNet datasets in both single-teacher and multiteacher scenarios, with only a modest increase in the number of required model parameters. The source code is publicly available at https://github.com/random2314235/ATSC.
翻译:知识蒸馏(KD)是一种将知识从过参数化的教师网络迁移至参数较少的学生网络的技术,从而最小化由此产生的性能损失。KD方法可分为离线与在线两类。离线KD利用强大的预训练教师网络,而在线KD允许动态调整教师网络以提升学生网络的学习效果。近期研究发现,共享教师网络的分类器能够显著提升学生网络的性能,同时仅需极少增加网络参数量。基于这些发现,我们提出了基于共享分类器的自适应教学(ATSC)。在ATSC中,预训练的教师网络根据学生网络的能力进行自我调整,以更好地契合其学习需求,而学生网络则通过共享分类器获益,从而提升其性能。此外,我们将ATSC扩展至多教师环境中。通过大量实验验证了所提KD方法的有效性。我们的方法在CIFAR-100和ImageNet数据集上,无论是单教师还是多教师场景,均取得了最先进的结果,且所需模型参数仅小幅增加。源代码已公开于https://github.com/random2314235/ATSC。