The "massively-multilingual" training of multilingual models is known to limit their utility in any one language, and they perform particularly poorly on low-resource languages. However, there is evidence that low-resource languages can benefit from targeted multilinguality, where the model is trained on closely related languages. To test this approach more rigorously, we systematically study best practices for adapting a pre-trained model to a language family. Focusing on the Uralic family as a test case, we adapt XLM-R under various configurations to model 15 languages; we then evaluate the performance of each experimental setting on two downstream tasks and 11 evaluation languages. Our adapted models significantly outperform mono- and multilingual baselines. Furthermore, a regression analysis of hyperparameter effects reveals that adapted vocabulary size is relatively unimportant for low-resource languages, and that low-resource languages can be aggressively up-sampled during training at little detriment to performance in high-resource languages. These results introduce new best practices for performing language adaptation in a targeted setting.
翻译:大规模多语言模型的多语言训练被证实会削弱其在单一语言上的效用,尤其在低资源语言上表现欠佳。然而有证据表明,低资源语言可从有针对性的多语言训练中获益——即使用密切相关的语言训练模型。为更严谨地验证该思路,我们系统研究了将预训练模型适配至语言族的最佳实践。以乌拉尔语系为研究案例,我们在多种配置下将XLM-R适配至15种语言,随后通过两项下游任务及11种评估语言检验各实验设置的性能表现。我们的适配模型显著优于单语言与多语言基线模型。此外,超参数效应的回归分析表明:对低资源语言而言,适配词汇表大小相对不重要,且训练期间对低资源语言进行激进上采样对高资源语言性能影响甚微。这些成果为针对性场景下的语言适配引入了新的最佳实践。