Knowledge distillation is an effective technique for pre-trained language model compression. However, existing methods only focus on the knowledge distribution among layers, which may cause the loss of fine-grained information in the alignment process. To address this issue, we introduce the Multi-aspect Knowledge Distillation (MaKD) method, which mimics the self-attention and feed-forward modules in greater depth to capture rich language knowledge information at different aspects. Experimental results demonstrate that MaKD can achieve competitive performance compared with various strong baselines with the same storage parameter budget. In addition, our method also performs well in distilling auto-regressive architecture models.
翻译:知识蒸馏是预训练语言模型压缩的一种有效技术。然而,现有方法仅关注层间的知识分布,这可能导致对齐过程中细粒度信息的丢失。为解决这一问题,我们引入了多层面知识蒸馏(MaKD)方法,该方法通过更深入地模仿自注意力机制和前馈模块,从不同层面捕获丰富的语言知识信息。实验结果表明,在相同的存储参数预算下,MaKD能够与多种强基线方法取得具有竞争力的性能。此外,我们的方法在蒸馏自回归架构模型时也表现出良好的效果。