Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation

Knowledge distillation (KD) is a promising technique for model compression in neural machine translation. However, where the knowledge hides in KD is still not clear, which may hinder the development of KD. In this work, we first unravel this mystery from an empirical perspective and show that the knowledge comes from the top-1 predictions of teachers, which also helps us build a potential connection between word- and sequence-level KD. Further, we point out two inherent issues in vanilla word-level KD based on this finding. Firstly, the current objective of KD spreads its focus to whole distributions to learn the knowledge, yet lacks special treatment on the most crucial top-1 information. Secondly, the knowledge is largely covered by the golden information due to the fact that most top-1 predictions of teachers overlap with ground-truth tokens, which further restricts the potential of KD. To address these issues, we propose a novel method named \textbf{T}op-1 \textbf{I}nformation \textbf{E}nhanced \textbf{K}nowledge \textbf{D}istillation (TIE-KD). Specifically, we design a hierarchical ranking loss to enforce the learning of the top-1 information from the teacher. Additionally, we develop an iterative KD procedure to infuse more additional knowledge by distilling on the data without ground-truth targets. Experiments on WMT'14 English-German, WMT'14 English-French and WMT'16 English-Romanian demonstrate that our method can respectively boost Transformer$_{base}$ students by +1.04, +0.60 and +1.11 BLEU scores and significantly outperform the vanilla word-level KD baseline. Besides, our method shows higher generalizability on different teacher-student capacity gaps than existing KD techniques.

翻译：知识蒸馏（KD）是一种极具前景的神经机器翻译模型压缩技术。然而，知识在KD中隐藏于何处仍不明确，这可能阻碍KD技术的发展。本文首先从实证角度揭示了这一谜题，指出知识主要来源于教师模型的top-1预测结果，这一发现也帮助我们在词级与序列级KD之间建立了潜在联系。基于此发现，我们进一步指出传统词级KD存在的两个固有缺陷：其一，当前KD的目标函数将学习重点分散至整个分布以获取知识，却缺乏对最关键top-1信息的特殊处理；其二，由于教师模型的大部分top-1预测与真实标注标记重叠，知识在很大程度上被黄金标注信息所覆盖，这进一步限制了KD的潜力。为解决这些问题，我们提出了一种名为**T**op-1 **I**nformation **E**nhanced **K**nowledge **D**istillation（TIE-KD）的新方法。具体而言，我们设计了分层排序损失以强化从教师模型学习top-1信息。此外，我们开发了迭代KD流程，通过在无真实标注目标的数据上进行蒸馏以注入更多额外知识。在WMT'14英德、WMT'14英法和WMT'16英罗翻译任务上的实验表明，本方法可分别将Transformer$_{base}$学生模型的BLEU分数提升+1.04、+0.60与+1.11，并显著优于传统词级KD基线。此外，相较于现有KD技术，本方法在不同师生模型容量差距场景下展现出更高的泛化能力。