The focus of language model evaluation has transitioned towards reasoning and knowledge-intensive tasks, driven by advancements in pretraining large models. While state-of-the-art models are partially trained on large Arabic texts, evaluating their performance in Arabic remains challenging due to the limited availability of relevant datasets. To bridge this gap, we present \datasetname{}, the first multi-task language understanding benchmark for the Arabic language, sourced from school exams across diverse educational levels in different countries spanning North Africa, the Levant, and the Gulf regions. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region. Our comprehensive evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models. Notably, BLOOMZ, mT0, LLaMA2, and Falcon struggle to achieve a score of 50%, while even the top-performing Arabic-centric model only achieves a score of 62.3%.
翻译:随着大规模预训练模型的进步,语言模型评估的重点已转向推理和知识密集型任务。尽管最先进的模型部分基于大量阿拉伯语文本进行训练,但由于相关数据集的稀缺,评估其在阿拉伯语上的性能仍然具有挑战性。为弥补这一差距,我们提出了\datasetname{},这是首个面向阿拉伯语的多任务语言理解基准,其数据来源于北非、黎凡特和海湾地区不同国家、涵盖多个教育阶段的学校考试。我们的数据集包含40项任务和14,575道现代标准阿拉伯语(MSA)多项选择题,并由该地区的母语者协作精心构建。我们对35个模型的全面评估揭示了显著的改进空间,尤其是在最佳开源模型中表现明显。值得注意的是,BLOOMZ、mT0、LLaMA2和Falcon模型均难以达到50%的得分,而即使是表现最佳的阿拉伯语专用模型也仅获得62.3%的得分。