Knowledge Distillation (KD) is a predominant approach for BERT compression. Previous KD-based methods focus on designing extra alignment losses for the student model to mimic the behavior of the teacher model. These methods transfer the knowledge in an indirect way. In this paper, we propose a novel Weight-Inherited Distillation (WID), which directly transfers knowledge from the teacher. WID does not require any additional alignment loss and trains a compact student by inheriting the weights, showing a new perspective of knowledge distillation. Specifically, we design the row compactors and column compactors as mappings and then compress the weights via structural re-parameterization. Experimental results on the GLUE and SQuAD benchmarks show that WID outperforms previous state-of-the-art KD-based baselines. Further analysis indicates that WID can also learn the attention patterns from the teacher model without any alignment loss on attention distributions. The code is available at https://github.com/wutaiqiang/WID-NAACL2024.
翻译:知识蒸馏(KD)是BERT压缩的主流方法。现有基于KD的方法主要设计额外的对齐损失,使学生模型模仿教师模型的行为,以间接方式传递知识。本文提出一种新型权继承蒸馏方法(WID),该方法直接从教师模型传递知识。WID无需任何额外对齐损失,通过继承权重训练紧凑的学生模型,展现了知识蒸馏的新视角。具体而言,我们设计了行压缩器和列压缩器作为映射函数,并通过结构重参数化实现权重压缩。在GLUE和SQuAD基准上的实验结果表明,WID的性能优于现有基于KD的最优方法。进一步分析显示,WID无需任何注意力分布对齐损失,即可从教师模型学到注意力模式。代码已开源至https://github.com/wutaiqiang/WID-NAACL2024。