Question answering (QA) is an actively studied topic, being a core natural language processing (NLP) task that needs to be addressed before achieving Artificial General Intelligence (AGI). However, the lack of QA datasets in specific domains and languages hinders the development of robust AI models able to generalize across various domains and languages. To this end, we introduce MedQARo, the first large-scale medical QA benchmark in Romanian, alongside a comprehensive evaluation of state-of-the-art (SOTA) large language models (LLMs). We construct a high-quality and large-scale dataset comprising 105,880 QA pairs related to cancer patients from two medical centers. The questions regard medical case summaries of 1,242 patients, requiring either keyword extraction or reasoning to be answered correctly. MedQARo is the result of a time-consuming manual annotation process carried out by seven physicians specialized in oncology or radiotherapy, who spent a total of about 3,000 work hours to generate the QA pairs. Our benchmark contains both in-domain and cross-domain (cross-center and cross-cancer) test collections, enabling a precise assessment of generalization capabilities. We experiment with four open-source LLMs from distinct families of models on MedQARo. Each model is employed in two scenarios, namely one based on zero-shot prompting and one based on supervised fine-tuning. We also evaluate two state-of-the-art LLMs exposed only through APIs, namely GPT-5.2 and Gemini 3 Flash. Our results show that fine-tuned models significantly outperform zero-shot models, clearly indicating that pretrained models fail to generalize on MedQARo. Our findings demonstrate the importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian. We publicly release our dataset and code at https://github.com/ana-rogoz/MedQARo.
翻译:问答是自然语言处理的核心任务,也是实现通用人工智能前必须解决的关键问题,目前正受到广泛研究。然而,特定领域和语言中问答数据集的缺乏,阻碍了能够跨领域和跨语言泛化的鲁棒人工智能模型的发展。为此,我们推出了首个罗马尼亚语大规模医学问答基准MedQARo,并对当前最先进的大语言模型进行了全面评估。我们构建了一个高质量、大规模的数据集,包含来自两个医疗中心的105,880个与癌症患者相关的问答对。这些问题基于1,242名患者的病历摘要,需要关键词提取或推理才能正确回答。MedQARo是耗时费力的手动标注过程的成果,由七位肿瘤学或放射治疗学专业的医生共同完成,总计投入约3,000个工时来生成这些问答对。我们的基准包含领域内和跨领域(跨中心、跨癌症类型)的测试集,能够精确评估模型的泛化能力。我们在MedQARo上对来自不同模型家族的四个开源大语言模型进行了实验。每个模型均在两种场景下使用:一种是基于零样本提示,另一种是基于有监督微调。我们还评估了两个仅通过API访问的先进大语言模型,即GPT-5.2和Gemini 3 Flash。我们的结果表明,经过微调的模型显著优于零样本模型,这清楚地表明预训练模型在MedQARo上无法有效泛化。我们的发现证明了针对特定领域和特定语言进行微调对于实现可靠的罗马尼亚语临床问答至关重要。我们已在https://github.com/ana-rogoz/MedQARo公开发布我们的数据集和代码。