Knowledge distillation (KD) is a simple and successful method to transfer knowledge from a teacher to a student model solely based on functional activity. However, current KD has a few shortcomings: it has recently been shown that this method is unsuitable to transfer simple inductive biases like shift equivariance, struggles to transfer out of domain generalization, and optimization time is magnitudes longer compared to default non-KD model training. To improve these aspects of KD, we propose Hard Augmentations for Robust Distillation (HARD), a generally applicable data augmentation framework, that generates synthetic data points for which the teacher and the student disagree. We show in a simple toy example that our augmentation framework solves the problem of transferring simple equivariances with KD. We then apply our framework in real-world tasks for a variety of augmentation models, ranging from simple spatial transformations to unconstrained image manipulations with a pretrained variational autoencoder. We find that our learned augmentations significantly improve KD performance on in-domain and out-of-domain evaluation. Moreover, our method outperforms even state-of-the-art data augmentations and since the augmented training inputs can be visualized, they offer a qualitative insight into the properties that are transferred from the teacher to the student. Thus HARD represents a generally applicable, dynamically optimized data augmentation technique tailored to improve the generalization and convergence speed of models trained with KD.
翻译:知识蒸馏(KD)是一种简单且成功的将知识从教师模型迁移至学生模型的方法,其仅基于功能活动进行迁移。然而,当前KD存在若干缺陷:近期研究表明,该方法不适用于转移简单的归纳偏置(如平移等变性)、难以实现域外泛化的转移,且优化时间比默认非KD模型训练长数个数量级。为改善KD的这些不足,我们提出面向鲁棒蒸馏的强数据增强方法(HARD),这是一种普适的数据增强框架,能够生成教师与学生模型产生分歧的合成数据点。我们在一个简单玩具示例中证明,该增强框架解决了通过KD转移简单等变性的问题。随后,我们将该框架应用于真实世界任务,涵盖从简单空间变换到使用预训练变分自编码器进行无约束图像操作等多种增强模型。实验发现,所习得的增强方法显著提升了KD在域内与域外评估中的性能。此外,我们的方法甚至优于最先进的数据增强技术,且由于增强训练输入可被可视化,这为教师模型向学生模型迁移的特性提供了定性洞察。因此,HARD代表了一种普适、动态优化的数据增强技术,专门用于改善经KD训练的模型的泛化能力与收敛速度。