In this paper, we address the significant gap in Arabic natural language processing (NLP) resources by introducing ArabicaQA, the first large-scale dataset for machine reading comprehension and open-domain question answering in Arabic. This comprehensive dataset, consisting of 89,095 answerable and 3,701 unanswerable questions created by crowdworkers to look similar to answerable ones, along with additional labels of open-domain questions marks a crucial advancement in Arabic NLP resources. We also present AraDPR, the first dense passage retrieval model trained on the Arabic Wikipedia corpus, specifically designed to tackle the unique challenges of Arabic text retrieval. Furthermore, our study includes extensive benchmarking of large language models (LLMs) for Arabic question answering, critically evaluating their performance in the Arabic language context. In conclusion, ArabicaQA, AraDPR, and the benchmarking of LLMs in Arabic question answering offer significant advancements in the field of Arabic NLP. The dataset and code are publicly accessible for further research https://github.com/DataScienceUIBK/ArabicaQA.
翻译:本文通过引入ArabicaQA——首个面向阿拉伯语机器阅读理解与开放域问答的大规模数据集,填补了阿拉伯语自然语言处理(NLP)资源的重大空白。该综合性数据集包含89,095个可回答问题与3,701个由众包工作者创建的、外观与可回答问题相似的不可回答问题,并附加开放域问答标签,标志着阿拉伯语NLP资源的关键进展。我们同时提出AraDPR——首个基于阿拉伯语维基百科语料库训练的密集段落检索模型,该模型专为应对阿拉伯语文本检索的特殊挑战而设计。此外,本研究对大语言模型(LLMs)在阿拉伯语问答中的表现进行了广泛基准测试,严谨评估了其在阿拉伯语语境下的性能。总之,ArabicaQA、AraDPR及阿拉伯语问答中LLMs的基准测试为阿拉伯语NLP领域带来了显著进步。数据集与代码已公开以供进一步研究使用:https://github.com/DataScienceUIBK/ArabicaQA。