Knowledge distillation addresses the problem of transferring knowledge from a teacher model to a student model. In this process, we typically have multiple types of knowledge extracted from the teacher model. The problem is to make full use of them to train the student model. Our preliminary study shows that: (1) not all of the knowledge is necessary for learning a good student model, and (2) knowledge distillation can benefit from certain knowledge at different training steps. In response to these, we propose an actor-critic approach to selecting appropriate knowledge to transfer during the process of knowledge distillation. In addition, we offer a refinement of the training algorithm to ease the computational burden. Experimental results on the GLUE datasets show that our method outperforms several strong knowledge distillation baselines significantly.
翻译:知识蒸馏解决了从教师模型向学生模型传递知识的问题。在此过程中,我们通常从教师模型中提取多种类型的知识。关键在于如何充分利用这些知识来训练学生模型。我们的初步研究表明:(1)并非所有知识对训练优秀学生模型都是必要的,(2)知识蒸馏在不同训练阶段可从特定知识中获益。针对这些问题,我们提出了一种基于演员-评论家算法的方法,用于在知识蒸馏过程中选择合适知识进行传递。此外,我们改进了训练算法以减轻计算负担。在GLUE数据集上的实验结果表明,我们的方法显著优于多个强基线知识蒸馏方法。