Knowledge distillation (KD) is a general neural network training approach that uses a teacher model to guide the student model. Existing works mainly study KD from the network output side (e.g., trying to design a better KD loss function), while few have attempted to understand it from the input side. Especially, its interplay with data augmentation (DA) has not been well understood. In this paper, we ask: Why do some DA schemes (e.g., CutMix) inherently perform much better than others in KD? What makes a "good" DA in KD? Our investigation from a statistical perspective suggests that a good DA scheme should reduce the covariance of the teacher-student cross-entropy. A practical metric, the stddev of teacher's mean probability (T. stddev), is further presented and well justified empirically. Besides the theoretical understanding, we also introduce a new entropy-based data-mixing DA scheme, CutMixPick, to further enhance CutMix. Extensive empirical studies support our claims and demonstrate how we can harvest considerable performance gains simply by using a better DA scheme in knowledge distillation.
翻译:知识蒸馏(KD)是一种通用神经网络训练方法,通过教师模型指导学生模型。现有研究主要从网络输出层面(如设计更优的KD损失函数)探索KD,鲜有研究从输入层面理解其机理。特别地,KD与数据增强(DA)的交互作用尚未得到充分揭示。本文提出以下问题:为何某些DA方案(如CutMix)在KD中天然表现显著优于其他方案?何谓KD中“好”的DA?我们从统计视角展开探究,结果表明:优秀的DA方案应能降低师生交叉熵的协方差。进一步提出实用度量指标——教师平均概率标准差(T. stddev),并通过大量实验验证其合理性。除理论解释外,本文还引入基于熵的新数据混合DA方案CutMixPick,以增强CutMix效果。广泛的实验研究支持我们的论断,并展示了在知识蒸馏中仅通过采用更优DA方案即可获得显著性能提升的可行路径。