We introduce MedQARo, the first large-scale medical QA benchmark in Romanian, alongside a comprehensive evaluation of state-of-the-art large language models (LLMs). We construct a high-quality and large-scale dataset comprising 105,880 QA pairs about cancer patients from two medical centers. The questions regard medical case summaries of 1,242 patients, requiring both keyword extraction and reasoning. Our benchmark contains both in-domain and cross-domain (cross-center and cross-cancer) test collections, enabling a precise assessment of generalization capabilities. We experiment with four open-source LLMs from distinct families of models on MedQARo. Each model is employed in two scenarios: zero-shot prompting and supervised fine-tuning. We also evaluate two state-of-the-art LLMs exposed only through APIs, namely GPT-5.2 and Gemini 3 Flash. Our results show that fine-tuned models significantly outperform zero-shot models, indicating that pretrained models fail to generalize on MedQARo. Our findings demonstrate the importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian.
翻译:我们推出了MedQARo,这是首个面向罗马尼亚语的大规模医学问答基准,并对当前最先进的大语言模型进行了全面评估。我们构建了一个高质量、大规模的数据集,包含来自两个医疗中心的105,880个关于癌症患者的问答对。这些问题基于1,242名患者的医学病例摘要,需要同时进行关键词提取与推理分析。本基准包含领域内与跨领域(跨中心、跨癌种)测试集,能够精确评估模型的泛化能力。我们在MedQARo上对来自不同模型家族的四个开源大语言模型进行了实验。每个模型均在两种场景下测试:零样本提示与监督微调。同时评估了仅通过API接口访问的两个前沿大语言模型——GPT-5.2与Gemini 3 Flash。实验结果表明,微调模型显著优于零样本模型,说明预训练模型在MedQARo上未能实现有效泛化。我们的研究证实,针对特定领域与特定语言的微调对于构建可靠的罗马尼亚语临床问答系统至关重要。