MediX-R1: Open Ended Medical Reinforcement Learning

We introduce MediX-R1, an open-ended Reinforcement Learning (RL) framework for medical multimodal large language models (MLLMs) that enables clinically grounded, free-form answers beyond multiple-choice formats. MediX-R1 fine-tunes a baseline vision-language backbone with Group Based RL and a composite reward tailored for medical reasoning: an LLM-based accuracy reward that judges semantic correctness with a strict YES/NO decision, a medical embedding-based semantic reward to capture paraphrases and terminology variants, and lightweight format and modality rewards that enforce interpretable reasoning and modality recognition. This multi-signal design provides stable, informative feedback for open-ended outputs where traditional verifiable or MCQ-only rewards fall short. To measure progress, we propose a unified evaluation framework for both text-only and image+text tasks that uses a Reference-based LLM-as-judge in place of brittle string-overlap metrics, capturing semantic correctness, reasoning, and contextual alignment. Despite using only $\sim51$K instruction examples, MediX-R1 achieves excellent results across standard medical LLM (text-only) and VLM (image + text) benchmarks, outperforming strong open-source baselines and delivering particularly large gains on open-ended clinical tasks. Our results demonstrate that open-ended RL with comprehensive reward signals and LLM-based evaluation is a practical path toward reliable medical reasoning in multimodal models. Our trained models, curated datasets and source code are available at https://medix.cvmbzuai.com

翻译：我们提出了MediX-R1，这是一个面向医疗多模态大语言模型（MLLMs）的开放式强化学习（RL）框架，能够生成基于临床背景、超越多项选择题形式的自由回答。MediX-R1通过基于分组的强化学习以及专为医疗推理设计的复合奖励，对基础视觉语言骨干模型进行微调：该复合奖励包括一个基于LLM的准确性奖励（通过严格的YES/NO决策判断语义正确性）、一个基于医疗嵌入的语义奖励（用于捕捉释义和术语变体），以及轻量级的格式和模态奖励（用于强制可解释的推理和模态识别）。这种多信号设计为开放式输出提供了稳定且信息丰富的反馈，而传统的可验证或仅限多项选择题的奖励机制在此类任务上存在不足。为了衡量进展，我们提出了一个统一的评估框架，适用于纯文本以及图像+文本任务。该框架使用基于参考的LLM作为评判者，替代脆弱的字符串重叠指标，从而捕捉语义正确性、推理能力和上下文对齐度。尽管仅使用了约51K个指令示例，MediX-R1在标准医疗LLM（纯文本）和VLM（图像+文本）基准测试中均取得了优异的结果，超越了强大的开源基线模型，并在开放式临床任务上实现了尤其显著的性能提升。我们的结果表明，结合全面奖励信号和基于LLM评估的开放式强化学习，是实现多模态模型中可靠医疗推理的一条实用路径。我们训练好的模型、精选数据集及源代码已在 https://medix.cvmbzuai.com 上发布。