Knowledge distillation has attracted a great deal of interest recently to compress pre-trained language models. However, existing knowledge distillation methods suffer from two limitations. First, the student model simply imitates the teacher's behavior while ignoring the underlying reasoning. Second, these methods usually focus on the transfer of sophisticated model-specific knowledge but overlook data-specific knowledge. In this paper, we present a novel attribution-driven knowledge distillation approach, which explores the token-level rationale behind the teacher model based on Integrated Gradients (IG) and transfers attribution knowledge to the student model. To enhance the knowledge transfer of model reasoning and generalization, we further explore multi-view attribution distillation on all potential decisions of the teacher. Comprehensive experiments are conducted with BERT on the GLUE benchmark. The experimental results demonstrate the superior performance of our approach to several state-of-the-art methods.
翻译:知识蒸馏近期在压缩预训练语言模型方面引起了广泛关注。然而,现有知识蒸馏方法存在两大局限性。首先,学生模型只是简单模仿教师的行为,而忽略了其背后的推理过程。其次,这些方法通常侧重于模型特定知识的迁移,却忽视了数据特定知识。本文提出了一种新颖的归因驱动知识蒸馏方法,该方法基于积分梯度(Integrated Gradients, IG)探索教师模型背后的词元级推理机制,并将归因知识迁移至学生模型。为增强模型推理与泛化能力的知识迁移,我们进一步探索了教师模型所有潜在决策上的多视角归因蒸馏。基于BERT在GLUE基准上进行了全面实验,结果表明我们的方法在性能上优于多种当前最先进的方法。