MedAraBench: Large-Scale Arabic Medical Question Answering Dataset and Benchmark

Arabic remains one of the most underrepresented languages in natural language processing research, particularly in medical applications, due to the limited availability of open-source data and benchmarks. The lack of resources hinders efforts to evaluate and advance the multilingual capabilities of Large Language Models (LLMs). In this paper, we introduce MedAraBench, a large-scale dataset consisting of Arabic multiple-choice question-answer pairs across various medical specialties. We constructed the dataset by manually digitizing a large repository of academic materials created by medical professionals in the Arabic-speaking region. We then conducted extensive preprocessing and split the dataset into training and test sets to support future research efforts in the area. To assess the quality of the data, we adopted two frameworks, namely expert human evaluation and LLM-as-a-judge. Our dataset is diverse and of high quality, spanning 19 specialties and five difficulty levels. For benchmarking purposes, we assessed the performance of eight state-of-the-art open-source and proprietary models, such as GPT-5, Gemini 2.0 Flash, and Claude 4-Sonnet. Our findings highlight the need for further domain-specific enhancements. We release the dataset and evaluation scripts to broaden the diversity of medical data benchmarks, expand the scope of evaluation suites for LLMs, and enhance the multilingual capabilities of models for deployment in clinical settings.

翻译：阿拉伯语仍然是自然语言处理研究中代表性最不足的语言之一，在医疗应用领域尤其如此，这主要归因于开源数据和基准的有限可用性。资源的缺乏阻碍了评估和提升大型语言模型（LLMs）多语言能力的努力。本文介绍了MedAraBench，这是一个包含跨多个医学专科的阿拉伯语多项选择题-答案对的大规模数据集。我们通过手动数字化一个由阿拉伯语地区医疗专业人员创建的大型学术资料库来构建该数据集。随后，我们进行了广泛的预处理，并将数据集划分为训练集和测试集，以支持该领域未来的研究工作。为了评估数据质量，我们采用了两种框架，即专家人工评估和LLM-as-a-judge。我们的数据集具有多样性和高质量的特点，涵盖19个专科和五个难度级别。出于基准测试的目的，我们评估了八种最先进的开源和专有模型的性能，例如GPT-5、Gemini 2.0 Flash和Claude 4-Sonnet。我们的研究结果凸显了进一步进行领域特定增强的必要性。我们发布该数据集和评估脚本，旨在拓宽医疗数据基准的多样性，扩展LLMs评估套件的范围，并增强模型在临床环境中部署的多语言能力。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

大模型如何适配生物医学？最新《生物医学大型语言模型研究》综述

专知会员服务

28+阅读 · 2024年9月11日

《SysEngBench：评估系统工程中大型语言模型的新基准》美海军最新报告

专知会员服务

50+阅读 · 2024年6月30日

158页《大型语言模型数据集》全面综述，444个数据集涵盖预训练、指令微调、偏好、评估等，附中英文版

专知会员服务

155+阅读 · 2024年3月1日