Knowledge distillation (KD) is a powerful model compression technique broadly used in practical deep learning applications. It is focused on training a small student network to mimic a larger teacher network. While it is widely known that KD can offer an improvement to student generalization in i.i.d setting, its performance under domain shift, i.e. the performance of student networks on data from domains unseen during training, has received little attention in the literature. In this paper we make a step towards bridging the research fields of knowledge distillation and domain generalization. We show that weight averaging techniques proposed in domain generalization literature, such as SWAD and SMA, also improve the performance of knowledge distillation under domain shift. In addition, we propose a simplistic weight averaging strategy that does not require evaluation on validation data during training and show that it performs on par with SWAD and SMA when applied to KD. We name our final distillation approach Weight-Averaged Knowledge Distillation (WAKD).
翻译:知识蒸馏(KD)是一种广泛应用于实际深度学习任务的强大模型压缩技术,其核心在于训练小型学生网络模仿大型教师网络。尽管在独立同分布(i.i.d.)设定下,KD已被公认能提升学生网络的泛化性能,但在领域偏移(即学生网络处理训练中未见过的数据域时的表现)情境下的性能却鲜有文献关注。本文旨在弥合知识蒸馏与领域泛化两个研究领域之间的鸿沟。我们证明,领域泛化文献中提出的权重平均技术(如SWAD和SMA)同样能提升知识蒸馏在领域偏移下的性能。此外,我们提出一种无需在训练过程中基于验证数据进行评估的简易权重平均策略,并证明该策略在应用于KD时性能与SWAD和SMA相当。我们将最终的蒸馏方法命名为"权重平均知识蒸馏(WAKD)"。