Learning to Project for Cross-Task Knowledge Distillation

Traditional knowledge distillation (KD) relies on a proficient teacher trained on the target task, which is not always available. In this setting, cross-task distillation can be used, enabling the use of any teacher model trained on a different task. However, many KD methods prove ineffective when applied to this cross-task setting. To address this limitation, we propose a simple modification: the use of an inverted projection. We show that this drop-in replacement for a standard projector is effective by learning to disregard any task-specific features which might degrade the student's performance. We find that this simple modification is sufficient for extending many KD methods to the cross-task setting, where the teacher and student tasks can be very different. In doing so, we obtain up to a 1.9% improvement in the cross-task setting compared to the traditional projection, at no additional cost. Our method can obtain significant performance improvements (up to 7%) when using even a randomly-initialised teacher on various tasks such as depth estimation, image translation, and semantic segmentation, despite the lack of any learned knowledge to transfer. To provide conceptual and analytical insights into this result, we show that using an inverted projection allows the distillation loss to be decomposed into a knowledge transfer and a spectral regularisation component. Through this analysis we are additionally able to propose a novel regularisation loss that allows teacher-free distillation, enabling performance improvements of up to 8.57% on ImageNet with no additional training costs.

翻译：传统知识蒸馏（KD）依赖于在目标任务上训练有素的专业教师模型，但这一条件并非总能满足。在此场景下，可采用跨任务蒸馏，使得任何在不同任务上训练的教师模型都能发挥作用。然而，多数知识蒸馏方法在应用于跨任务场景时效果不佳。为克服这一局限，我们提出一种简洁的改进方案：使用反向投影。通过学习忽略可能降低学生模型性能的特定任务特征，我们证明这种作为标准投影器即插即用替代方案的机制是有效的。研究发现，这一简单改动足以将多种知识蒸馏方法扩展到跨任务场景，即使教师与学生任务差异显著。由此，在跨任务场景下，相较于传统投影方法，我们以零额外代价实现了最高1.9%的性能提升。当使用随机初始化的教师模型进行深度估计、图像翻译、语义分割等任务时，尽管缺乏可迁移的已习得知识，该方法仍能获得显著性能改善（最高达7%）。为从概念和分析层面揭示这一结果，我们证明反向投影可使蒸馏损失分解为知识迁移项与频谱正则化项。基于这一分析，我们进一步提出一种无需教师模型的新型正则化损失函数，可在ImageNet数据集上实现高达8.57%的性能提升，且无需额外训练成本。