MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology to assist medical experts for interactive decision support, which has been demonstrated by their competitive performances in Medical QA. However, while impressive, the required quality bar for medical applications remains far from being achieved. Currently, LLMs remain challenged by outdated knowledge and by their tendency to generate hallucinated content. Furthermore, most benchmarks to assess medical knowledge lack reference gold explanations which means that it is not possible to evaluate the reasoning of LLMs predictions. Finally, the situation is particularly grim if we consider benchmarking LLMs for languages other than English which remains, as far as we know, a totally neglected topic. In order to address these shortcomings, in this paper we present MedExpQA, the first multilingual benchmark based on medical exams to evaluate LLMs in Medical Question Answering. To the best of our knowledge, MedExpQA includes for the first time reference gold explanations written by medical doctors which can be leveraged to establish various gold-based upper-bounds for comparison with LLMs performance. Comprehensive multilingual experimentation using both the gold reference explanations and Retrieval Augmented Generation (RAG) approaches show that performance of LLMs still has large room for improvement, especially for languages other than English. Furthermore, and despite using state-of-the-art RAG methods, our results also demonstrate the difficulty of obtaining and integrating readily available medical knowledge that may positively impact results on downstream evaluations for Medical Question Answering. So far the benchmark is available in four languages, but we hope that this work may encourage further development to other languages.

翻译：大型语言模型（LLMs）在医学问答任务中展现出的竞争力，表明其具备推动人工智能技术发展以辅助医学专家进行交互式决策支持的潜力。然而，尽管表现令人印象深刻，当前LLMs在医学应用领域所需达到的质量标准仍远未实现。目前，LLMs仍受限于知识更新滞后及易生成幻觉内容的倾向。此外，大多数评估医学知识的基准测试缺乏参考性的黄金解释，这意味着无法对LLMs预测的推理过程进行有效评估。最后，若考虑对英语以外的语言进行LLMs基准测试，情况尤为严峻——据我们所知，这仍是一个完全被忽视的领域。为应对这些不足，本文提出MedExpQA，首个基于医学考试的多语言基准测试，用于评估LLMs在医学问答任务中的表现。据我们所知，MedExpQA首次纳入了由医学专家撰写的参考黄金解释，这些解释可用于建立多种基于黄金标准的性能上限，以对比LLMs的表现。通过结合黄金参考解释与检索增强生成（RAG）方法进行的全面多语言实验表明，LLMs的性能仍有巨大提升空间，尤其在非英语语言中。此外，尽管采用了最先进的RAG方法，我们的结果也揭示了获取并整合可用的医学知识以积极影响下游医学问答评估结果的难度。目前该基准测试已支持四种语言，但我们希望这项工作能推动其向更多语言扩展。