Knowledge distillation is the technique of compressing a larger neural network, known as the teacher, into a smaller neural network, known as the student, while still trying to maintain the performance of the larger neural network as much as possible. Existing methods of knowledge distillation are mostly applicable for classification tasks. Many of them also require access to the data used to train the teacher model. To address the problem of knowledge distillation for regression tasks under the absence of original training data, previous work has proposed a data-free knowledge distillation method where synthetic data are generated using a generator model trained adversarially against the student model. These synthetic data and their labels predicted by the teacher model are then used to train the student model. In this study, we investigate the behavior of various synthetic data generation methods and propose a new synthetic data generation strategy that directly optimizes for a large but bounded difference between the student and teacher model. Our results on benchmark and case study experiments demonstrate that the proposed strategy allows the student model to learn better and emulate the performance of the teacher model more closely.
翻译:知识蒸馏是一种将较大的神经网络(称为教师模型)压缩为较小的神经网络(称为学生模型)的技术,同时尽可能保持较大神经网络的性能。现有的知识蒸馏方法大多适用于分类任务,且很多方法需要访问用于训练教师模型的原始数据。为解决原始训练数据缺失情况下回归任务的知识蒸馏问题,前人研究提出了一种无数据知识蒸馏方法,该方法利用与学生模型对抗训练的生成器模型生成合成数据,并利用教师模型预测的合成数据及其标签来训练学生模型。本研究通过分析多种合成数据生成方法的行为,提出了一种新的合成数据生成策略,该策略直接优化学生模型与教师模型之间的差异,使其保持在较大但有界的范围内。在基准实验和案例研究中的结果表明,所提出的策略使学生模型能够更好地学习,并更接近地模拟教师模型的性能。