Pretrained language models (PLMs) display impressive performances and have captured the attention of the NLP community. Establishing the best practices in pretraining has therefore become a major point of focus for much of NLP research -- especially since the insights developed for monolingual English models need not carry to more complex multilingual. One significant caveat of the current state of the art is that different works are rarely comparable: they often discuss different parameter counts, training data, and evaluation methodology. This paper proposes a comparison of multilingual pretraining objectives in a controlled methodological environment. We ensure that training data and model architectures are comparable, and discuss the downstream performances across 6 languages that we observe in probing and fine-tuning scenarios. We make two key observations: (1) the architecture dictates which pretraining objective is optimal; (2) multilingual translation is a very effective pre-training objective under the right conditions. We make our code, data, and model weights available at \texttt{\url{https://github.com/Helsinki-NLP/lm-vs-mt}}.
翻译:预训练语言模型(PLMs)展现出令人瞩目的性能,并已引起自然语言处理(NLP)领域的广泛关注。因此,建立预训练的最佳实践已成为NLP研究的重点方向——尤其是考虑到针对单语英语模型所获得的洞见未必适用于更复杂的多语言场景。当前研究现状的一个重要局限在于,不同工作之间往往缺乏可比性:它们通常涉及不同的参数量、训练数据和评估方法。本文提出在受控的方法论环境下对多语言预训练目标进行比较。我们确保训练数据和模型架构具有可比性,并探讨了在探测与微调场景下,在6种语言中观察到的下游性能。我们得到两个关键发现:(1)模型架构决定了何种预训练目标是最优的;(2)在适当条件下,多语言翻译是一种非常有效的预训练目标。我们的代码、数据及模型权重已发布于 \texttt{\url{https://github.com/Helsinki-NLP/lm-vs-mt}}。