Traditional knowledge distillation (KD) relies on a proficient teacher trained on the target task, which is not always available. In this setting, cross-task distillation can be used, enabling the use of any teacher model trained on a different task. However, many KD methods prove ineffective when applied to this cross-task setting. To address this limitation, we propose a simple modification: the use of an inverted projection. We show that this drop-in replacement for a standard projector is effective by learning to disregard any task-specific features which might degrade the student's performance. We find that this simple modification is sufficient for extending many KD methods to the cross-task setting, where the teacher and student tasks can be very different. In doing so, we obtain up to a 1.9% improvement in the cross-task setting compared to the traditional projection, at no additional cost. Our method can obtain significant performance improvements (up to 7%) when using even a randomly-initialised teacher on various tasks such as depth estimation, image translation, and semantic segmentation, despite the lack of any learned knowledge to transfer. To provide conceptual and analytical insights into this result, we show that using an inverted projection allows the distillation loss to be decomposed into a knowledge transfer and a spectral regularisation component. Through this analysis we are additionally able to propose a novel regularisation loss that allows teacher-free distillation, enabling performance improvements of up to 8.57% on ImageNet with no additional training costs.
翻译:传统的知识蒸馏依赖于在目标任务上训练有素的教师模型,而这种模型并非总是可得。在此情况下,可采用跨任务蒸馏方法,从而能够利用在不同任务上训练的任何教师模型。然而,许多知识蒸馏方法在这种跨任务场景中被证明效果有限。为应对这一局限,我们提出一种简单的改进方案:采用逆投影机制。我们证明,通过学会忽略可能降低学生模型性能的任务特定特征,这种可即插即换的标准投影器替代方案具有显著效果。我们发现,这一简单改进足以将多种知识蒸馏方法扩展至跨任务场景,即使教师与学生的任务差异巨大。通过该方法,我们在不增加额外成本的前提下,相比传统投影方式在跨任务场景中取得了最高1.9%的性能提升。在深度估计、图像翻译和语义分割等多种任务中,即使使用随机初始化的教师模型(尽管其不具备可迁移的习得知识),我们的方法仍能实现最高7%的显著性能提升。为从概念和分析层面阐释这一结果,我们证明逆投影的使用可使蒸馏损失分解为知识传递和谱正则化两个组成部分。基于此分析,我们进一步提出一种新颖的正则化损失函数,实现了无需教师模型的蒸馏方法,在ImageNet数据集上以零额外训练成本取得了最高8.57%的性能提升。