We introduce a new reading comprehension dataset, dubbed MultiWikiQA, which covers 306 languages. The context data comes from Wikipedia articles, with questions generated by an LLM and the answers appearing verbatim in the Wikipedia articles. We conduct a crowdsourced human evaluation of the fluency of the generated questions across 30 of the languages, providing evidence that the questions are of good quality. We evaluate 6 different language models, both decoder and encoder models of varying sizes, showing that the benchmark is sufficiently difficult and that there is a large performance discrepancy amongst the languages. The dataset and survey evaluations are freely available.
翻译:本文介绍了一个新的阅读理解数据集MultiWikiQA,涵盖306种语言。上下文数据来源于维基百科文章,问题由大型语言模型生成,答案则直接出现在维基百科文章中。我们对其中30种语言生成问题的流畅性进行了众包人工评估,结果证明问题质量良好。我们评估了6种不同规模的语言模型(包括解码器和编码器模型),结果表明该基准具有足够难度,且不同语言间的性能存在显著差异。数据集及评估结果已公开提供。