The advent of scalable deep models and large datasets has improved the performance of Neural Machine Translation. Knowledge Distillation (KD) enhances efficiency by transferring knowledge from a teacher model to a more compact student model. However, KD approaches to Transformer architecture often rely on heuristics, particularly when deciding which teacher layers to distill from. In this paper, we introduce the 'Align-to-Distill' (A2D) strategy, designed to address the feature mapping problem by adaptively aligning student attention heads with their teacher counterparts during training. The Attention Alignment Module in A2D performs a dense head-by-head comparison between student and teacher attention heads across layers, turning the combinatorial mapping heuristics into a learning problem. Our experiments show the efficacy of A2D, demonstrating gains of up to +3.61 and +0.63 BLEU points for WMT-2022 De->Dsb and WMT-2014 En->De, respectively, compared to Transformer baselines.
翻译:可扩展深度模型与大规模数据集的兴起显著提升了神经机器翻译的性能。知识蒸馏通过将教师模型的知识迁移至更紧凑的学生模型来提高效率。然而,面向Transformer架构的知识蒸馏方法常依赖启发式策略,尤其在决定从哪些教师层进行蒸馏时更是如此。本文提出"对齐-蒸馏"(A2D)策略,旨在通过训练过程中自适应地对齐学生注意力头与教师对应注意力头来解决特征映射问题。A2D中的注意力对齐模块在层间对学生与教师的注意力头执行密集的逐头比较,将组合式映射启发策略转化为可学习问题。实验表明,与Transformer基线相比,A2D在WMT-2022德→德标音和WMT-2014英→德任务上分别取得了最高+3.61和+0.63 BLEU分的提升,验证了其有效性。