PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation

BACKGROUND: Medical large language models (LLMs) have demonstrated remarkable performance in answering medical examinations. However, the extent to which this high performance is transferable to medical questions in Spanish and from a Latin American country remains unexplored. This knowledge is crucial as LLM-based medical applications gain traction in Latin America. AIMS: To build a dataset of questions medical examinations taken by Peruvian physicians pursuing specialty training; to fine-tune a LLM on this dataset; to evaluate and compare the performance in terms of accuracy between vanilla LLMs and the fine-tuned LLM. METHODS: We curated PeruMedQA, a multiple-choice question-answering (MCQA) dataset containing 8,380 questions spanning 12 specialties (2018-2025). We selected ten medical LLMs, including medgemma-4b-it and medgemma-27b-text-it, and developed zero-shot task specific prompts to answer the questions. We employed parameter-efficient fine tuning (PEFT) and low-rand adaptation (LoRA) to fine-tune medgemma-4b-it utilizing all questions except those from 2025 (test set). RESULTS: Medgemma-27b showed the highest accuracy across all specialities, achieving the highest score of 89.29% in Psychiatry; yet, in two specialties, OctoMed-7B exhibited slight superiority: Neurosurgery with 77.27% and 77.38, respectively; and Radiology with 76.13% and 77.39%, respectively. Across specialties, most LLMs with <10 billion parameters exhibited <50% of correct answers. The fine-tuned version of medgemma-4b-it emerged victorious against all LLMs with <10 billion parameters and rivaled a LLM with 70 billion parameters across various examinations. CONCLUSIONS: For medical AI applications and research that require knowledge bases from Spanish-speaking countries and those exhibiting similar epidemiological profile to Peru's, interested parties should utilize medgemma-27b-text-it.

翻译：背景：医学大语言模型（LLMs）在回答医学考试题目方面已展现出卓越性能。然而，这种高性能在多大程度上能迁移至西班牙语及拉丁美洲国家的医学问题仍未被探索。随着基于LLM的医学应用在拉丁美洲日益普及，这一认知至关重要。目的：构建一个包含秘鲁专科培训医师医学考试题目的数据集；基于该数据集对大语言模型进行微调；评估并比较原始大语言模型与微调后模型在准确率方面的性能差异。方法：我们构建了PeruMedQA——一个包含8,380道选择题（MCQA）的数据集，涵盖12个医学专科（2018-2025年）。我们选取了包括medgemma-4b-it和medgemma-27b-text-it在内的十种医学大语言模型，并设计了零样本任务特定提示词来回答问题。我们采用参数高效微调（PEFT）与低秩自适应（LoRA）技术，使用除2025年试题（测试集）外的全部数据对medgemma-4b-it进行微调。结果：Medgemma-27b在所有专科中均表现出最高准确率，其中在精神病学专科达到最高分89.29%；然而在神经外科（77.27%对77.38%）和放射科（76.13%对77.39%）两个专科中，OctoMed-7B分别略占优势。在所有专科中，多数参数量小于100亿的大语言模型正确率低于50%。经微调的medgemma-4b-it版本在所有参数量小于100亿的大语言模型中胜出，并在多项考试中与参数量达700亿的模型性能相当。结论：对于需要基于西班牙语国家及与秘鲁具有相似流行病学特征知识库的医学人工智能应用与研究，相关方应优先采用medgemma-27b-text-it模型。