In this work, we study how the performance of a given direction changes with its sampling ratio in Multilingual Neural Machine Translation (MNMT). By training over 200 multilingual models with various model sizes, data sizes, and language directions, we find it interesting that the performance of certain translation direction does not always improve with the increase of its weight in the multi-task optimization objective. Accordingly, scalarization method leads to a multitask trade-off front that deviates from the traditional Pareto front when there exists data imbalance in the training corpus, which poses a great challenge to improve the overall performance of all directions. Based on our observations, we propose the Double Power Law to predict the unique performance trade-off front in MNMT, which is robust across various languages, data adequacy, and the number of tasks. Finally, we formulate the sample ratio selection problem in MNMT as an optimization problem based on the Double Power Law. In our experiments, it achieves better performance than temperature searching and gradient manipulation methods with only 1/5 to 1/2 of the total training budget. We release the code at https://github.com/pkunlp-icler/ParetoMNMT for reproduction.
翻译:本文研究了多语言神经机器翻译(MNMT)中特定翻译方向性能随采样比例变化的关系。通过训练200多个不同模型规模、数据规模和语言方向的多语言模型,我们发现一个有趣现象:某些翻译方向的性能并不总是随其在多任务优化目标中的权重增加而提升。相应地,当训练语料存在数据不平衡时,标量化方法会导致多任务权衡前沿偏离传统帕累托前沿,这对提升所有方向的整体性能构成巨大挑战。基于观察,我们提出双幂律(Double Power Law)来预测MNMT中独特的性能权衡前沿,该规律在不同语言、数据充分性和任务数量下均表现出鲁棒性。最后,我们将MNMT中的采样比例选择问题形式化为基于双幂律的优化问题。实验表明,该方法仅使用总训练预算的1/5至1/2即可达到优于温度搜索和梯度操控方法的性能。我们已在https://github.com/pkunlp-icler/ParetoMNMT开源代码以便复现。