Knowledge distillation aims to enhance the performance of a lightweight student model by exploiting the knowledge from a pre-trained cumbersome teacher model. However, in the traditional knowledge distillation, teacher predictions are only used to provide the supervisory signal for the last layer of the student model, which may result in those shallow student layers lacking accurate training guidance in the layer-by-layer back propagation and thus hinders effective knowledge transfer. To address this issue, we propose Deeply-Supervised Knowledge Distillation (DSKD), which fully utilizes class predictions and feature maps of the teacher model to supervise the training of shallow student layers. A loss-based weight allocation strategy is developed in DSKD to adaptively balance the learning process of each shallow layer, so as to further improve the student performance. Extensive experiments on CIFAR-100 and TinyImageNet with various teacher-student models show significantly performance, confirming the effectiveness of our proposed method. Code is available at: $\href{https://github.com/luoshiya/DSKD}{https://github.com/luoshiya/DSKD}$
翻译:知识蒸馏旨在通过利用预训练的复杂教师模型的知识,提升轻量级学生模型的性能。然而,在传统知识蒸馏中,教师预测仅用于提供学生模型最后一层的监督信号,这可能导致浅层学生层在逐层反向传播时缺乏精确的训练指导,从而阻碍有效的知识迁移。为解决这一问题,我们提出深度监督知识蒸馏(DSKD),该方法充分利用教师模型的类别预测和特征图来监督浅层学生层的训练。DSKD中设计了一种基于损失的权重分配策略,自适应地平衡每个浅层的学习过程,从而进一步提升学生性能。在CIFAR-100和TinyImageNet数据集上使用多种教师-学生模型进行的广泛实验表明,性能显著提升,验证了我们所提出方法的有效性。代码开源地址:$\href{https://github.com/luoshiya/DSKD}{https://github.com/luoshiya/DSKD}$