OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models

High-stakes clinical use of large vision-language models (LVLMs) requires reasoning that is grounded in visual evidence and clinical knowledge, not just correct final answers. We introduce OpenMedReason, a large-scale, open multimodal medical reasoning corpus comprising approximately 450K image-question-answer instances whose reasoning traces are primarily derived from curated biomedical, human-authored scientific articles. OpenMedReason provides high-fidelity supervision beyond synthetic chains of thought, covering diverse medical domain vision modalities such as radiological scans, microscopic images, visible light photographs, charts, and others. We complement it with OpenMedReason-Bench, a held-out benchmark that allows fine-grained evaluation of LVLMs along three complementary axes of capability, including perception, medical knowledge, and rationale, enabling diagnostic evaluation beyond final-answer accuracy. OpenMedReason is a rich training resource that exhibits its effectiveness in both supervised fine-tuning (SFT) and reinforcement-based alignment. Training with OpenMedReason yields a 20% average improvement in VQA accuracy over the base model and achieves performance within 4.2% of the strongest comparable-scale medical LVLMs. Fine-grained performance analysis confirms that the gains are not concentrated in any single axis: OpenMedReason improves perception, medical knowledge, and rationale jointly, and its reasoning traces are preferred over those of the base model in 86.1% of pairwise comparisons. We release the code and dataset at huggingface.co/datasets/neginb/OpenMedReason.

翻译：在高风险的临床应用中，大型视觉语言模型（LVLMs）的推理不仅要得出正确的最终答案，还必须基于视觉证据和临床知识。我们提出OpenMedReason，这是一个大规模、开放的多模态医学推理语料库，包含约45万个图像-问题-答案实例，其推理轨迹主要来源于经过精心筛选的生物医学领域人类撰写的科学文章。OpenMedReason提供了超越合成思维链的高保真监督，覆盖多种医学领域视觉模态，包括放射学扫描、显微图像、可见光照片、图表等。我们还配套发布了OpenMedReason-Bench，一个留出基准测试集，可沿感知、医学知识和推理逻辑三个互补的能力轴对LVLMs进行细粒度评估，从而实现超越最终答案准确率的诊断性评估。OpenMedReason是一个丰富的训练资源，在监督微调（SFT）和基于强化学习的对齐中均展现出有效性。使用OpenMedReason进行训练后，基础模型的VQA准确率平均提升20%，且性能与最强可比规模的医学LVLMs差距在4.2%以内。细粒度性能分析证实，性能增益并非集中在单一维度：OpenMedReason同时提升了感知、医学知识和推理逻辑能力，且在86.1%的成对比较中，其推理轨迹优于基础模型。我们已在huggingface.co/datasets/neginb/OpenMedReason上发布代码和数据集。