Knowledge distillation, transferring knowledge from a teacher model to a student model, has emerged as a powerful technique in neural machine translation for compressing models or simplifying training targets. Knowledge distillation encompasses two primary methods: sentence-level distillation and token-level distillation. In sentence-level distillation, the student model is trained to align with the output of the teacher model, which can alleviate the training difficulty and give student model a comprehensive understanding of global structure. Differently, token-level distillation requires the student model to learn the output distribution of the teacher model, facilitating a more fine-grained transfer of knowledge. Studies have revealed divergent performances between sentence-level and token-level distillation across different scenarios, leading to the confusion on the empirical selection of knowledge distillation methods. In this study, we argue that token-level distillation, with its more complex objective (i.e., distribution), is better suited for ``simple'' scenarios, while sentence-level distillation excels in ``complex'' scenarios. To substantiate our hypothesis, we systematically analyze the performance of distillation methods by varying the model size of student models, the complexity of text, and the difficulty of decoding procedure. While our experimental results validate our hypothesis, defining the complexity level of a given scenario remains a challenging task. So we further introduce a novel hybrid method that combines token-level and sentence-level distillation through a gating mechanism, aiming to leverage the advantages of both individual methods. Experiments demonstrate that the hybrid method surpasses the performance of token-level or sentence-level distillation methods and the previous works by a margin, demonstrating the effectiveness of the proposed hybrid method.
翻译:知识蒸馏作为一种将教师模型知识迁移至学生模型的技术,在神经机器翻译中已成为压缩模型或简化训练目标的有效手段。知识蒸馏包含两种主要方法:句子级蒸馏和Token级蒸馏。句子级蒸馏通过训练学生模型与教师模型输出对齐,可降低训练难度并使学生模型获得对全局结构的全面理解。而Token级蒸馏则要求学生模型学习教师模型的输出分布,实现更细粒度的知识迁移。研究表明,句子级蒸馏与Token级蒸馏在不同场景下表现各异,导致在实践中对知识蒸馏方法的选择产生困惑。本研究提出,Token级蒸馏因其更复杂的目标(即分布)更适用于"简单"场景,而句子级蒸馏在"复杂"场景中表现更优。为验证这一假设,我们通过改变学生模型参数规模、文本复杂度及解码过程难度,系统分析了两种蒸馏方法的性能。实验结果虽证实了假设,但定义给定场景的复杂度仍具挑战性。为此,我们进一步提出一种通过门控机制融合Token级与句子级蒸馏的新型混合方法,旨在结合两种方法的优势。实验表明,该混合方法显著优于单一Token级或句子级蒸馏方法及以往工作,验证了其有效性。