Multilingual neural machine translation has witnessed remarkable progress in recent years. However, the long-tailed distribution of multilingual corpora poses a challenge of Pareto optimization, i.e., optimizing for some languages may come at the cost of degrading the performance of others. Existing balancing training strategies are equivalent to a series of Pareto optimal solutions, which trade off on a Pareto frontier. In this work, we propose a new training framework, Pareto Mutual Distillation (Pareto-MD), towards pushing the Pareto frontier outwards rather than making trade-offs. Specifically, Pareto-MD collaboratively trains two Pareto optimal solutions that favor different languages and allows them to learn from the strengths of each other via knowledge distillation. Furthermore, we introduce a novel strategy to enable stronger communication between Pareto optimal solutions and broaden the applicability of our approach. Experimental results on the widely-used WMT and TED datasets show that our method significantly pushes the Pareto frontier and outperforms baselines by up to +2.46 BLEU.
翻译:多语言神经机器翻译近年来取得了显著进展。然而,多语言语料库的长尾分布带来了帕累托优化的挑战,即优化某些语言可能以降低其他语言性能为代价。现有的平衡训练策略等价于一系列帕累托最优解,这些解在帕累托前沿上进行权衡。在这项工作中,我们提出了一种新的训练框架——帕累托互蒸馏(Pareto-MD),旨在向外推动帕累托前沿而非进行权衡。具体而言,Pareto-MD协同训练两个偏好不同语言的帕累托最优解,并通过知识蒸馏使它们彼此学习对方的优势。此外,我们引入了一种新颖的策略来增强帕累托最优解之间的通信能力,并拓宽了我们方法的适用性。在广泛使用的WMT和TED数据集上的实验结果表明,我们的方法显著推动了帕累托前沿,并在BLEU值上比基线方法提升了高达+2.46。