Pretrained language models (PLMs) display impressive performances and have captured the attention of the NLP community. Establishing best practices in pretraining has, therefore, become a major focus of NLP research, especially since insights gained from monolingual English models may not necessarily apply to more complex multilingual models. One significant caveat of the current state of the art is that different works are rarely comparable: they often discuss different parameter counts, training data, and evaluation methodology. This paper proposes a comparison of multilingual pretraining objectives in a controlled methodological environment. We ensure that training data and model architectures are comparable, and discuss the downstream performances across 6 languages that we observe in probing and fine-tuning scenarios. We make two key observations: (1) the architecture dictates which pretraining objective is optimal; (2) multilingual translation is a very effective pretraining objective under the right conditions. We make our code, data, and model weights available at \texttt{\url{https://github.com/Helsinki-NLP/lm-vs-mt}}.
翻译:预训练语言模型(PLMs)展现出令人瞩目的性能,并已引起自然语言处理(NLP)领域的广泛关注。因此,建立预训练的最佳实践已成为NLP研究的重点,特别是考虑到从单语英语模型中获得的洞见可能不一定适用于更复杂的多语言模型。当前最先进技术的一个显著局限在于,不同研究之间往往缺乏可比性:它们通常涉及不同的参数量、训练数据和评估方法。本文提出在一个受控的方法论环境中比较多语言预训练目标。我们确保训练数据和模型架构具有可比性,并讨论了在探测和微调场景下,我们在6种语言中观察到的下游性能。我们得出两个关键观察:(1)模型架构决定了哪种预训练目标是最优的;(2)在适当的条件下,多语言翻译是一种非常有效的预训练目标。我们的代码、数据和模型权重已在 \texttt{\url{https://github.com/Helsinki-NLP/lm-vs-mt}} 公开。