Automatic evaluation of medical open-ended question answering (OEQA) remains challenging due to the need for expert annotations. We evaluate whether large language models (LLMs) can act as judges of semantic equivalence in French medical OEQA, comparing closed-access, general-purpose, and biomedical domain-adapted models. Our results show that LLM-based judgments are strongly influenced by the model that generated the answer, with agreement varying substantially across generators. Domain-adapted and large general-purpose models achieve the highest alignment with expert annotations. We further show that lightweight adaptation of a compact model using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) substantially improves performance and reduces generator sensitivity, even with limited data. Overall, our findings highlight the need for generator-aware evaluation and suggest that carefully adapted small models can support scalable evaluation in low-resource medical settings.
翻译:医学开放式问答(OEQA)的自动评估因需要专家标注而仍然具有挑战性。我们评估大型语言模型(LLMs)是否能在法语医学OEQA中充当语义等价性的评判者,比较了闭源模型、通用模型以及生物医学领域适应模型。结果表明,基于LLM的判断强烈受生成答案的模型影响,不同生成器之间的评估一致性差异显著。领域适应模型和大型通用模型与专家标注的一致性最高。我们进一步证明,使用监督微调(SFT)和组相对策略优化(GRPO)对紧凑模型进行轻量级适应,即使数据有限,也能显著提升性能并降低对生成器的敏感性。总体而言,我们的研究结果强调了生成器感知评估的必要性,并表明经过精心适应的小型模型能够支持低资源医学环境下的可扩展评估。