Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data

Fundus imaging such as CFP, OCT and UWF is crucial for the early detection of retinal anomalies and diseases. Fundus image understanding, due to its knowledge-intensive nature, poses a challenging vision-language task. An emerging approach to addressing the task is to post-train a generic multimodal large language model (MLLM), either by supervised finetuning (SFT) or by reinforcement learning with verifiable rewards (RLVR), on a considerable amount of in-house samples paired with high-quality clinical reports. However, these valuable samples are not publicly accessible, which not only hinders reproducibility but also practically limits research to few players. To overcome the barrier, we make a novel attempt to train a reasoning-enhanced fundus-reading MLLM, which we term Fundus-R1, using exclusively public datasets, wherein over 94\% of the data are annotated with only image-level labels. Our technical contributions are two-fold. First, we propose a RAG-based method for composing image-specific, knowledge-aware reasoning traces. Such auto-generated traces link visual findings identified by a generic MLLM to the image labels in terms of ophthalmic knowledge. Second, we enhance RLVR with a process reward that encourages self-consistency of the generated reasoning trace in each rollout. Extensive experiments on three fundus-reading benchmarks, i.e., FunBench, Omni-Fundus and GMAI-Fundus, show that Fundus-R1 clearly outperforms multiple baselines, including its generic counterpart (Qwen2.5-VL) and a stronger edition post-trained without using the generated traces. This work paves the way for training powerful fundus-reading MLLMs with publicly available data.

翻译：眼底成像（如CFP、OCT和UWF）对于视网膜异常和疾病的早期检测至关重要。由于眼底图像理解具有知识密集型特点，这是一项具有挑战性的视觉-语言任务。解决该任务的新兴方法是在大量配备高质量临床报告的私有样本上，通过监督微调（SFT）或基于可验证奖励的强化学习（RLVR）对通用多模态大语言模型（MLLM）进行后训练。然而，这些宝贵的样本未公开，不仅阻碍了研究的可重复性，还将研究实践局限在少数机构。为突破这一障碍，我们做出全新尝试，利用完全公开的数据集训练推理增强型眼底读图MLLM（命名为Fundus-R1），其中超过94%的数据仅标注了图像级标签。我们的技术贡献有两方面：首先，提出基于RAG的方法来构建图像特异的知识感知推理轨迹。这些自动生成的轨迹将通用MLLM识别的视觉发现与眼底医学知识中的图像标签建立关联。其次，我们采用过程奖励增强RLVR，以鼓励每次推演中生成推理轨迹的自一致性。在三个眼底读图基准（FunBench、Omni-Fundus和GMAI-Fundus）上的广泛实验表明，Fundus-R1明显优于多个基线模型，包括其通用版本（Qwen2.5-VL）以及未使用生成轨迹进行后训练的更强版本。该工作为利用公开数据训练强大的眼底读图MLLM奠定了基础。