While multilingual language models (MLMs) have been trained on 100+ languages, they are typically only evaluated across a handful of them due to a lack of available test data in most languages. This is particularly problematic when assessing MLM's potential for low-resource and unseen languages. In this paper, we present an analysis of existing evaluation frameworks in multilingual NLP, discuss their limitations, and propose several directions for more robust and reliable evaluation practices. Furthermore, we empirically study to what extent machine translation offers a {reliable alternative to human translation} for large-scale evaluation of MLMs across a wide set of languages. We use a SOTA translation model to translate test data from 4 tasks to 198 languages and use them to evaluate three MLMs. We show that while the selected subsets of high-resource test languages are generally sufficiently representative of a wider range of high-resource languages, we tend to overestimate MLMs' ability on low-resource languages. Finally, we show that simpler baselines can achieve relatively strong performance without having benefited from large-scale multilingual pretraining.
翻译:尽管多语言语言模型(MLMs)已在超过100种语言上进行训练,但由于大多数语言缺乏可用的测试数据,通常仅对少数几种语言进行评估。这在评估MLMs对低资源语言和未见语言的潜力时尤为突出。本文分析了多语言自然语言处理领域的现有评估框架,探讨其局限性,并提出若干构建更稳健可靠评估实践的方向。此外,我们通过实证研究探讨机器翻译在多大程度上能为大规模多语言MLMs评估提供{人工翻译的可靠替代方案}。我们采用最先进的翻译模型将4项任务的测试数据翻译为198种语言,并以此评估三种MLMs。研究表明:虽然精选的高资源测试语言子集通常足以代表更广泛的高资源语言,但我们往往高估了MLMs在低资源语言上的能力。最后,我们证明即使未受益于大规模多语言预训练,更简单的基线方法也能取得相对较强的性能。