Arabic remains one of the most underrepresented languages in natural language processing research, particularly in medical applications, due to the limited availability of open-source data and benchmarks. The lack of resources hinders efforts to evaluate and advance the multilingual capabilities of Large Language Models (LLMs). In this paper, we introduce MedAraBench, a large-scale dataset consisting of Arabic multiple-choice question-answer pairs across various medical specialties. We constructed the dataset by manually digitizing a large repository of academic materials created by medical professionals in the Arabic-speaking region. We then conducted extensive preprocessing and split the dataset into training and test sets to support future research efforts in the area. To assess the quality of the data, we adopted two frameworks, namely expert human evaluation and LLM-as-a-judge. Our dataset is diverse and of high quality, spanning 19 specialties and five difficulty levels. For benchmarking purposes, we assessed the performance of eight state-of-the-art open-source and proprietary models, such as GPT-5, Gemini 2.0 Flash, and Claude 4-Sonnet. Our findings highlight the need for further domain-specific enhancements. We release the dataset and evaluation scripts to broaden the diversity of medical data benchmarks, expand the scope of evaluation suites for LLMs, and enhance the multilingual capabilities of models for deployment in clinical settings.
翻译:阿拉伯语仍然是自然语言处理研究中代表性最不足的语言之一,在医疗应用领域尤其如此,这主要归因于开源数据和基准的有限可用性。资源的缺乏阻碍了评估和提升大型语言模型(LLMs)多语言能力的努力。本文介绍了MedAraBench,这是一个包含跨多个医学专科的阿拉伯语多项选择题-答案对的大规模数据集。我们通过手动数字化一个由阿拉伯语地区医疗专业人员创建的大型学术资料库来构建该数据集。随后,我们进行了广泛的预处理,并将数据集划分为训练集和测试集,以支持该领域未来的研究工作。为了评估数据质量,我们采用了两种框架,即专家人工评估和LLM-as-a-judge。我们的数据集具有多样性和高质量的特点,涵盖19个专科和五个难度级别。出于基准测试的目的,我们评估了八种最先进的开源和专有模型的性能,例如GPT-5、Gemini 2.0 Flash和Claude 4-Sonnet。我们的研究结果凸显了进一步进行领域特定增强的必要性。我们发布该数据集和评估脚本,旨在拓宽医疗数据基准的多样性,扩展LLMs评估套件的范围,并增强模型在临床环境中部署的多语言能力。