Knowledge distillation aims to learn a lightweight student network from a pre-trained teacher network. In practice, existing knowledge distillation methods are usually infeasible when the original training data is unavailable due to some privacy issues and data management considerations. Therefore, data-free knowledge distillation approaches proposed to collect training instances from the Internet. However, most of them have ignored the common distribution shift between the instances from original training data and webly collected data, affecting the reliability of the trained student network. To solve this problem, we propose a novel method dubbed ``Knowledge Distillation between Different Distributions" (KD$^{3}$), which consists of three components. Specifically, we first dynamically select useful training instances from the webly collected data according to the combined predictions of teacher network and student network. Subsequently, we align both the weighted features and classifier parameters of the two networks for knowledge memorization. Meanwhile, we also build a new contrastive learning block called MixDistribution to generate perturbed data with a new distribution for instance alignment, so that the student network can further learn a distribution-invariant representation. Intensive experiments on various benchmark datasets demonstrate that our proposed KD$^{3}$ can outperform the state-of-the-art data-free knowledge distillation approaches.
翻译:知识蒸馏旨在从预训练的教师网络中学习轻量级学生网络。在实际应用中,由于隐私问题和数据管理方面的考虑,原始训练数据通常无法获取,导致现有知识蒸馏方法难以实施。因此,无数据知识蒸馏方法被提出用于从互联网收集训练样本。然而,大多数方法忽视了原始训练数据与网络采集数据之间常见的分布偏移问题,这影响了所训练学生网络的可靠性。为解决该问题,我们提出了一种名为“不同分布间的知识蒸馏”(KD$^{3}$)的新方法,该方法包含三个组件。具体而言,我们首先根据教师网络和学生网络的联合预测,从网络采集数据中动态选择有用的训练样本。随后,我们对两个网络的加权特征和分类器参数进行对齐以实现知识记忆。同时,我们构建了一个名为MixDistribution的新型对比学习模块,用于生成具有新分布的扰动数据以实现实例对齐,从而使学生网络能够进一步学习到分布不变的表征。在多个基准数据集上的大量实验表明,我们提出的KD$^{3}$方法能够优于当前最先进的无数据知识蒸馏方法。