We introduce a new reading comprehension dataset, dubbed MultiWikiQA, which covers 306 languages and has 1,220,757 samples in total. We start with Wikipedia articles, which also provide the context for the dataset samples, and use an LLM to generate question/answer pairs related to the Wikipedia article, ensuring that the answer appears verbatim within the article. Next, the question is then rephrased to hinder simple word matching methods from performing well on the dataset. We conduct a crowdsourced human evaluation of the fluency of the generated questions, which included 156 respondents across 30 of the languages (both low- and high-resource). All 30 languages received a mean fluency rating above ``mostly natural'', showing that the samples are of good quality. We evaluate 6 different language models, both decoder and encoder models of varying sizes, showing that the benchmark is sufficiently difficult and that there is a large performance discrepancy amongst the languages. Both the dataset and survey evaluations are publicly available.
翻译:我们提出了一个新的阅读理解数据集,命名为MultiWikiQA,该数据集覆盖306种语言,总计包含1,220,757个样本。我们以维基百科文章作为数据样本的上下文来源,利用大语言模型生成与维基百科文章相关的问题/答案对,并确保答案在文章中逐字出现。随后,我们对问题进行改写,以阻碍简单的词语匹配方法在数据集上取得良好效果。我们通过众包方式对生成问题的流畅度进行了人工评估,评估涉及30种语言(包括低资源与高资源语言)的156名受访者。所有30种语言的平均流畅度评分均高于“基本自然”水平,表明样本质量良好。我们评估了6种不同规模的语言模型(包括解码器与编码器模型),结果表明该基准具有足够的难度,且不同语言之间的性能存在显著差异。数据集与调查评估结果均已公开提供。