The purpose of this work is to share an English-Yor\`ub\'a evaluation dataset for open-book reading comprehension and text generation to assess the performance of models both in a high- and a low- resource language. The dataset contains 358 questions and answers on 338 English documents and 208 Yor\`ub\'a documents. The average document length is ~ 10k words for English and 430 words for Yor\`ub\'a. Experiments show a consistent disparity in performance between the two languages, with Yor\`ub\'a falling behind English for automatic metrics even if documents are much shorter for this language. For a small set of documents with comparable length, performance of Yor\`ub\'a drops by x2.5 times. When analyzing performance by length, we observe that Yor\`ub\'a decreases performance dramatically for documents that reach 1500 words while English performance is barely affected at that length. Our dataset opens the door to showcasing if English LLM reading comprehension capabilities extend to Yor\`ub\'a, which for the evaluated LLMs is not the case.
翻译:本研究旨在发布一个用于开卷阅读理解与文本生成的英语-约鲁巴语评估数据集,以评估模型在高资源语言与低资源语言上的性能。该数据集包含基于338篇英语文档与208篇约鲁巴语文档构建的358组问答对。英语文档平均长度约10,000词,约鲁巴语文档平均长度约430词。实验结果表明两种语言的性能存在系统性差异:即使约鲁巴语文档长度显著更短,其自动评估指标仍落后于英语。在文档长度相近的小规模子集上,约鲁巴语的性能下降达2.5倍。通过长度维度分析发现,当文档达到1500词时,约鲁巴语性能急剧下降,而英语在该长度下几乎不受影响。本数据集为验证英语大语言模型的阅读理解能力是否可迁移至约鲁巴语提供了评估基准,实验表明当前受测的大语言模型尚未实现这种跨语言能力迁移。