Large language models (LLMs) in medicine are mainly evaluated using multiple-choice question answering (MCQA), which can overestimate real clinical ability due to guessing strategies and answer biases. To address these limitations, we introduce an expanded and more challenging benchmark based on Polish medical exams, adding over 15,000 questions, two new domains, and four structural modifications that reduce MCQA-specific artifacts and better test reasoning. We evaluate 21 LLMs and show that evaluation design strongly affects results. Under our harder setup, the best model (Qwen3.5-122B) drops by 28.4 and 31 pp on English and Polish exams, respectively. Despite low evidence of data contamination, standard MCQA scores do not reliably reflect true medical competence. To facilitate further research, we make our benchmark publicly available.
翻译:大语言模型(LLMs)在医学领域的评估主要采用多项选择题问答(MCQA),但由于猜测策略和答案偏差,这种方式可能高估真实的临床能力。为了解决这些局限性,我们基于波兰医学考试引入了一个规模更大、更具挑战性的基准测试,新增了超过15,000道题目、两个新领域以及四种结构改进,以减少MCQA特有的伪影并更好地测试推理能力。我们评估了21个LLM,并证明评估设计会显著影响结果。在我们的更难设置下,最佳模型(Qwen3.5-122B)在英语和波兰语考试中的分数分别下降了28.4个百分点和31个百分点。尽管数据污染的证据较低,但标准MCQA分数并不能可靠地反映真实的医学能力。为促进进一步研究,我们公开了该基准测试。