Conventional knowledge distillation (KD) methods require access to the internal information of teachers, e.g., logits. However, such information may not always be accessible for large pre-trained language models (PLMs). In this work, we focus on decision-based KD for PLMs, where only teacher decisions (i.e., top-1 labels) are accessible. Considering the information gap between logits and decisions, we propose a novel method to estimate logits from the decision distributions. Specifically, decision distributions can be both derived as a function of logits theoretically and estimated with test-time data augmentation empirically. By combining the theoretical and empirical estimations of the decision distributions together, the estimation of logits can be successfully reduced to a simple root-finding problem. Extensive experiments show that our method significantly outperforms strong baselines on both natural language understanding and machine reading comprehension datasets.
翻译:传统知识蒸馏方法需要访问教师模型的内部信息(如逻辑值)。然而,这类信息对于大规模预训练语言模型而言并非总是可获取的。本文聚焦预训练语言模型的基于决策的知识蒸馏场景,其中仅能获取教师模型的决策结果(即top-1标签)。针对逻辑值与决策之间的信息鸿沟,我们提出一种从决策分布中估计逻辑值的新方法。具体而言,决策分布既可在理论上表示为逻辑值的函数,也可通过测试时数据增强进行经验估计。通过融合理论估计与经验估计的决策分布,逻辑值的估计问题可简化为简单的求根问题。大量实验表明,本方法在自然语言理解与机器阅读理解数据集上均显著优于强基线方法。