We present Belebele, a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. Significantly expanding the language coverage of natural language understanding (NLU) benchmarks, this dataset enables the evaluation of text models in high-, medium-, and low-resource languages. Each question is based on a short passage from the Flores-200 dataset and has four multiple-choice answers. The questions were carefully curated to discriminate between models with different levels of general language comprehension. The English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. We use this dataset to evaluate the capabilities of multilingual masked language models (MLMs) and large language models (LLMs). We present extensive results and find that despite significant cross-lingual transfer in English-centric LLMs, much smaller MLMs pretrained on balanced multilingual data still understand far more languages. We also observe that larger vocabulary size and conscious vocabulary construction correlate with better performance on low-resource languages. Overall, Belebele opens up new avenues for evaluating and analyzing the multilingual capabilities of NLP systems.
翻译:我们提出了Belebele,一个涵盖122种语言变体的多项选择机器阅读理解(MRC)数据集。该数据集显著扩展了自然语言理解(NLU)基准的语言覆盖范围,使得评估高资源、中资源和低资源语言中的文本模型成为可能。每个问题基于Flores-200数据集中的一个短段落,并设有四个多项选择答案。这些问题经过精心设计,旨在区分具有不同水平通用语言理解能力的模型。仅英语数据集本身已足够挑战最先进的语言模型。由于完全平行,该数据集支持对所有语言的模型性能进行直接比较。我们使用该数据集评估多语言掩码语言模型(MLMs)和大语言模型(LLMs)的能力。我们提供了广泛的结果,发现尽管以英语为中心的LLMs存在显著的跨语言迁移,但在平衡多语言数据上预训练的小得多的MLMs仍能理解远更多的语言。我们还观察到,更大的词汇量和有意识的词汇构建与低资源语言上的更好性能相关。总体而言,Belebele为评估和分析NLP系统的多语言能力开辟了新途径。