In this work, we provide a large-scale empirical study of the scaling properties of multilingual neural machine translation models. We examine how increases in the model size affect the model performance and investigate the role of the training mixture composition on the scaling behavior. We find that changing the weightings of the individual language pairs in the training mixture only affect the multiplicative factor of the scaling law. In particular, we observe that multilingual models trained using different mixing rates all exhibit the same scaling exponent. Through a novel joint scaling law formulation, we compute the effective number of parameters allocated to each language pair and examine the role of language similarity in the scaling behavior of our models. We find little evidence that language similarity has any impact. In contrast, the direction of the multilinguality plays a significant role, with models translating from multiple languages into English having a larger number of effective parameters per task than their reversed counterparts. Finally, we leverage our observations to predict the performance of multilingual models trained with any language weighting at any scale, significantly reducing efforts required for language balancing in large multilingual models. Our findings apply to both in-domain and out-of-domain test sets and to multiple evaluation metrics, such as ChrF and BLEURT.
翻译:本文对多语言神经机器翻译模型的缩放特性进行了大规模实证研究。我们考察了模型规模增长对性能的影响,并探讨了训练混合成分对缩放行为的作用。研究发现,调整训练混合中不同语言对的权重仅影响缩放定律的乘性因子。特别地,我们观察到采用不同混合率训练的多语言模型均表现出相同的缩放指数。通过创新的联合缩放定律公式,我们计算了分配给每个语言对的有效参数数量,并检验了语言相似性对模型缩放行为的影响。实验表明语言相似性几乎没有产生任何影响,而多语言方向则发挥重要作用——从多种语言翻译至英语的模型相比逆方向模型,每项任务具有更多有效参数。最后,我们利用这些观测结果预测采用任意语言权重训练的、任意规模的多语言模型性能,显著减少了大型多语言模型中语言平衡所需的工作量。研究发现同时适用于领域内和领域外测试集,以及ChrF和BLEURT等多种评估指标。