On the Evaluation Practices in Multilingual NLP: Can Machine Translation Offer an Alternative to Human Translations?

While multilingual language models (MLMs) have been trained on 100+ languages, they are typically only evaluated across a handful of them due to a lack of available test data in most languages. This is particularly problematic when assessing MLM's potential for low-resource and unseen languages. In this paper, we present an analysis of existing evaluation frameworks in multilingual NLP, discuss their limitations, and propose several directions for more robust and reliable evaluation practices. Furthermore, we empirically study to what extent machine translation offers a {reliable alternative to human translation} for large-scale evaluation of MLMs across a wide set of languages. We use a SOTA translation model to translate test data from 4 tasks to 198 languages and use them to evaluate three MLMs. We show that while the selected subsets of high-resource test languages are generally sufficiently representative of a wider range of high-resource languages, we tend to overestimate MLMs' ability on low-resource languages. Finally, we show that simpler baselines can achieve relatively strong performance without having benefited from large-scale multilingual pretraining.

翻译：尽管多语言语言模型（MLMs）已在超过100种语言上进行训练，但由于大多数语言缺乏可用的测试数据，通常仅对少数几种语言进行评估。这在评估MLMs对低资源语言和未见语言的潜力时尤为突出。本文分析了多语言自然语言处理领域的现有评估框架，探讨其局限性，并提出若干构建更稳健可靠评估实践的方向。此外，我们通过实证研究探讨机器翻译在多大程度上能为大规模多语言MLMs评估提供{人工翻译的可靠替代方案}。我们采用最先进的翻译模型将4项任务的测试数据翻译为198种语言，并以此评估三种MLMs。研究表明：虽然精选的高资源测试语言子集通常足以代表更广泛的高资源语言，但我们往往高估了MLMs在低资源语言上的能力。最后，我们证明即使未受益于大规模多语言预训练，更简单的基线方法也能取得相对较强的性能。

相关内容

Machine Translation

关注 210

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日