In this work, we study how the generalization performance of a given direction changes with its sampling ratio in Multilingual Neural Machine Translation (MNMT). By training over 200 multilingual models with various model sizes, directions, and total numbers of tasks, we find that scalarization leads to a multitask trade-off front that deviates from the traditional Pareto front when there exists data imbalance in the training corpus. That is, the performance of certain translation directions does not improve with the increase of its weight in the multi-task optimization objective, which poses a great challenge to improve the overall performance of all directions. Based on our observations, we propose the Double Power Law to predict the unique performance trade-off front in MNMT, which is robust across various languages, data adequacy, and the number of tasks. Finally, we formulate the sample ratio selection problem in MNMT as an optimization problem based on the Double Power Law, which achieves better performance than temperature searching and gradient manipulation methods using up to half of the total training budget in our experiments.
翻译:本文研究多语言神经机器翻译(MNMT)中,特定方向的泛化性能如何随其采样比例变化而变化。通过训练200多个具有不同模型规模、方向和任务总数的多语言模型,我们发现当训练语料存在数据不平衡时,标量化方法会导致多任务权衡前沿偏离传统帕累托前沿。即某些翻译方向的性能并未随其在多任务优化目标中权重的增加而提升,这给提升所有方向的整体性能带来了巨大挑战。基于我们的观察,我们提出双幂律来预测MNMT中独特的性能权衡前沿,该定律在多种语言、数据充分性和任务数量下均具有鲁棒性。最后,我们将MNMT中的采样比例选择问题表述为基于双幂律的优化问题,在实验中仅用总训练预算的一半即可获得优于温度搜索和梯度操控方法的性能表现。