Neural Architecture Search for Effective Teacher-Student Knowledge Transfer in Language Models

Large pre-trained language models have achieved state-of-the-art results on a variety of downstream tasks. Knowledge Distillation (KD) of a smaller student model addresses their inefficiency, allowing for deployment in resource-constraint environments. KD however remains ineffective, as the student is manually selected from a set of existing options already pre-trained on large corpora, a sub-optimal choice within the space of all possible student architectures. This paper proposes KD-NAS, the use of Neural Architecture Search (NAS) guided by the Knowledge Distillation process to find the optimal student model for distillation from a teacher, for a given natural language task. In each episode of the search process, a NAS controller predicts a reward based on a combination of accuracy on the downstream task and latency of inference. The top candidate architectures are then distilled from the teacher on a small proxy set. Finally the architecture(s) with the highest reward is selected, and distilled on the full downstream task training set. When distilling on the MNLI task, our KD-NAS model produces a 2 point improvement in accuracy on GLUE tasks with equivalent GPU latency with respect to a hand-crafted student architecture available in the literature. Using Knowledge Distillation, this model also achieves a 1.4x speedup in GPU Latency (3.2x speedup on CPU) with respect to a BERT-Base Teacher, while maintaining 97% performance on GLUE Tasks (without CoLA). We also obtain an architecture with equivalent performance as the hand-crafted student model on the GLUE benchmark, but with a 15% speedup in GPU latency (20% speedup in CPU latency) and 0.8 times the number of parameters

翻译：大型预训练语言模型已在多种下游任务中取得最先进成果。知识蒸馏（KD）通过训练一个较小的学生模型来解决其低效问题，使其能够部署在资源受限的环境中。然而，知识蒸馏仍不够有效，因为学生模型通常是从一组已在大型语料库上预训练的现有选项中手动选择的，这在所有可能的学生架构空间中是一个次优选择。本文提出KD-NAS，即利用知识蒸馏过程指导的神经架构搜索（NAS），为给定自然语言任务寻找从教师模型中蒸馏得到的最优学生模型。在搜索过程的每一轮中，NAS控制器基于下游任务的准确率与推理延迟的组合来预测奖励值。然后从教师模型中对排名靠前的候选架构在一个小型代理集上进行蒸馏。最终，选择奖励值最高的架构，并在完整下游任务训练集上对其进行蒸馏。在MNLI任务上进行蒸馏时，我们的KD-NAS模型在GLUE任务上相比文献中可用的手工设计学生架构，在GPU延迟相同的情况下准确率提升了2个点。通过知识蒸馏，该模型相比BERT-Base教师模型在GPU延迟上实现了1.4倍加速（CPU上3.2倍加速），同时在GLUE任务（不含CoLA）上保持了97%的性能。我们还获得了一个在GLUE基准上与手工设计学生模型性能相当，但GPU延迟提升15%（CPU延迟提升20%）、参数量减少至0.8倍的架构。