Knowledge distillation, a technique for model compression and performance enhancement, has gained significant traction in Neural Machine Translation (NMT). However, existing research primarily focuses on empirical applications, and there is a lack of comprehensive understanding of how student model capacity, data complexity, and decoding strategies collectively influence distillation effectiveness. Addressing this gap, our study conducts an in-depth investigation into these factors, particularly focusing on their interplay in word-level and sequence-level distillation within NMT. Through extensive experimentation across datasets like IWSLT13 En$\rightarrow$Fr, IWSLT14 En$\rightarrow$De, and others, we empirically validate hypotheses related to the impact of these factors on knowledge distillation. Our research not only elucidates the significant influence of model capacity, data complexity, and decoding strategies on distillation effectiveness but also introduces a novel, optimized distillation approach. This approach, when applied to the IWSLT14 de$\rightarrow$en translation task, achieves state-of-the-art performance, demonstrating its practical efficacy in advancing the field of NMT.
翻译:知识蒸馏作为一种模型压缩与性能增强技术,已在神经机器翻译领域获得广泛关注。然而,现有研究主要聚焦于经验性应用,对学生模型能力、数据复杂度以及解码策略如何共同影响蒸馏效果缺乏系统性理解。针对这一研究空白,本研究深入探究了这些因素,特别关注其在词级蒸馏与序列级蒸馏中的相互作用。通过在IWSLT13 En→Fr、IWSLT14 En→De等多个数据集上开展广泛实验,我们实证验证了这些因素对知识蒸馏影响的假设。本项研究不仅阐明了模型能力、数据复杂度及解码策略对蒸馏效果的显著影响,还提出了一种新型优化蒸馏方法。该方法应用于IWSLT14 de→en翻译任务时取得了最优性能,充分证明了其在推动神经机器翻译领域发展方面的实际效用。