Knowledge distillation (KD) has proven to be a successful strategy to improve the performance of smaller models in many NLP tasks. However, most of the work in KD only explores monolingual scenarios. In this paper, we investigate the value of KD in multilingual settings. We find the significance of KD and model initialization by analyzing how well the student model acquires multilingual knowledge from the teacher model. Our proposed method emphasizes copying the teacher model's weights directly to the student model to enhance initialization. Our findings show that model initialization using copy-weight from the fine-tuned teacher contributes the most compared to the distillation process itself across various multilingual settings. Furthermore, we demonstrate that efficient weight initialization preserves multilingual capabilities even in low-resource scenarios.
翻译:知识蒸馏(KD)已被证明是提升小型模型在众多自然语言处理任务中性能的有效策略。然而,现有KD研究大多仅探索单语场景。本文研究了KD在多语言环境中的价值。通过分析学生模型从教师模型获取多语言知识的效果,我们揭示了KD与模型初始化的重要性。我们提出的方法强调直接将教师模型的权重复制到学生模型以增强初始化。研究结果表明,在各种多语言场景中,使用微调后教师模型的复制权重进行模型初始化,其贡献甚至超过蒸馏过程本身。此外,我们证明了高效的权重初始化即使在低资源场景下也能保持多语言能力。