The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants

Lucas Bandarkar,Davis Liang,Benjamin Muller,Mikel Artetxe,Satya Narayan Shukla,Donald Husa,Naman Goyal,Abhinandan Krishnan,Luke Zettlemoyer,Madian Khabsa

from arxiv, ACL 2024

We present Belebele, a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. Significantly expanding the language coverage of natural language understanding (NLU) benchmarks, this dataset enables the evaluation of text models in high-, medium-, and low-resource languages. Each question is based on a short passage from the Flores-200 dataset and has four multiple-choice answers. The questions were carefully curated to discriminate between models with different levels of general language comprehension. The English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. We use this dataset to evaluate the capabilities of multilingual masked language models (MLMs) and large language models (LLMs). We present extensive results and find that despite significant cross-lingual transfer in English-centric LLMs, much smaller MLMs pretrained on balanced multilingual data still understand far more languages. We also observe that larger vocabulary size and conscious vocabulary construction correlate with better performance on low-resource languages. Overall, Belebele opens up new avenues for evaluating and analyzing the multilingual capabilities of NLP systems.

翻译：我们提出了Belebele，一个涵盖122种语言变体的多项选择机器阅读理解（MRC）数据集。该数据集显著扩展了自然语言理解（NLU）基准的语言覆盖范围，使得评估高资源、中资源和低资源语言中的文本模型成为可能。每个问题基于Flores-200数据集中的一个短段落，并设有四个多项选择答案。这些问题经过精心设计，旨在区分具有不同水平通用语言理解能力的模型。仅英语数据集本身已足够挑战最先进的语言模型。由于完全平行，该数据集支持对所有语言的模型性能进行直接比较。我们使用该数据集评估多语言掩码语言模型（MLMs）和大语言模型（LLMs）的能力。我们提供了广泛的结果，发现尽管以英语为中心的LLMs存在显著的跨语言迁移，但在平衡多语言数据上预训练的小得多的MLMs仍能理解远更多的语言。我们还观察到，更大的词汇量和有意识的词汇构建与低资源语言上的更好性能相关。总体而言，Belebele为评估和分析NLP系统的多语言能力开辟了新途径。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日