MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images

Multimodal large language models (MLLMs) have rapidly advanced, yet their adoption in medicine remains limited by gaps in domain coverage, modality alignment, and grounded reasoning. In this work, we introduce MedMO, a medical foundation model built upon a generalized MLLM architecture and trained exclusively on large-scale, domain-specific data. MedMO follows a multi-stage training recipe: (i) cross-modal pretraining to align heterogeneous visual encoders with a medical language backbone; (ii) instruction tuning on multi-task supervision that spans captioning, VQA, report generation, retrieval, and grounded disease localization with bounding boxes; and (iii) reinforcement learning with verifiable rewards that combine factuality checks with a box-level GIoU reward to strengthen spatial grounding and step-by-step reasoning in complex clinical scenarios. MedMO consistently outperforms strong open-source medical MLLMs across multiple modalities and tasks. On VQA benchmarks, MedMO achieves an average accuracy improvement of +13.7% over the baseline and performs within 1.9% of the SOTA Fleming-VL. For text-based QA, it attains +6.9% over the baseline and +14.5% over Fleming-VL. In medical report generation, MedMO delivers significant gains in both semantic and clinical accuracy. Moreover, it exhibits strong grounding capability, achieving an IoU improvement of +40.4 over the baseline and +37.0% over Fleming-VL, underscoring its robust spatial reasoning and localization performance. Evaluations across radiology, ophthalmology, and pathology-microscopy confirm MedMO's broad cross-modality generalization. We release two versions of MedMO: 4B and 8B. Project is available at https://genmilab.github.io/MedMO-Page

翻译：多模态大语言模型（MLLMs）发展迅速，但其在医学领域的应用仍受限于领域覆盖不足、模态对齐不充分以及基础推理能力欠缺等问题。本研究提出MedMO，一个基于通用MLLM架构、完全使用大规模领域特定数据训练的医学基础模型。MedMO采用多阶段训练策略：（i）跨模态预训练，将异构视觉编码器与医学语言主干网络对齐；（ii）在多任务监督下进行指令微调，任务涵盖图像描述、视觉问答、报告生成、检索以及带边界框的疾病定位；（iii）基于可验证奖励的强化学习，结合事实性检查与边界框级GIoU奖励，以增强复杂临床场景中的空间基础定位和逐步推理能力。MedMO在多种模态和任务上持续超越当前主流的开源医学MLLMs。在视觉问答基准测试中，MedMO相比基线模型平均准确率提升+13.7%，与当前最优模型Fleming-VL的差距仅为1.9%。在基于文本的问答任务中，其表现分别超过基线模型6.9%和Fleming-VL模型14.5%。在医学报告生成任务中，MedMO在语义准确性和临床准确性方面均取得显著提升。此外，该模型展现出强大的基础定位能力，其交并比（IoU）相比基线模型提升40.4，相比Fleming-VL模型提升37.0%，充分证明了其稳健的空间推理与定位性能。在放射学、眼科学和病理显微学等多个领域的评估结果验证了MedMO具有广泛的跨模态泛化能力。我们发布了MedMO的4B和8B两个版本。项目主页：https://genmilab.github.io/MedMO-Page