Fundus imaging such as CFP, OCT and UWF is crucial for the early detection of retinal anomalies and diseases. Fundus image understanding, due to its knowledge-intensive nature, poses a challenging vision-language task. An emerging approach to addressing the task is to post-train a generic multimodal large language model (MLLM), either by supervised finetuning (SFT) or by reinforcement learning with verifiable rewards (RLVR), on a considerable amount of in-house samples paired with high-quality clinical reports. However, these valuable samples are not publicly accessible, which not only hinders reproducibility but also practically limits research to few players. To overcome the barrier, we make a novel attempt to train a reasoning-enhanced fundus-reading MLLM, which we term Fundus-R1, using exclusively public datasets, wherein over 94\% of the data are annotated with only image-level labels. Our technical contributions are two-fold. First, we propose a RAG-based method for composing image-specific, knowledge-aware reasoning traces. Such auto-generated traces link visual findings identified by a generic MLLM to the image labels in terms of ophthalmic knowledge. Second, we enhance RLVR with a process reward that encourages self-consistency of the generated reasoning trace in each rollout. Extensive experiments on three fundus-reading benchmarks, i.e., FunBench, Omni-Fundus and GMAI-Fundus, show that Fundus-R1 clearly outperforms multiple baselines, including its generic counterpart (Qwen2.5-VL) and a stronger edition post-trained without using the generated traces. This work paves the way for training powerful fundus-reading MLLMs with publicly available data.
翻译:眼底成像(如CFP、OCT和UWF)对于视网膜异常和疾病的早期检测至关重要。由于眼底图像理解具有知识密集型特点,这是一项具有挑战性的视觉-语言任务。解决该任务的新兴方法是在大量配备高质量临床报告的私有样本上,通过监督微调(SFT)或基于可验证奖励的强化学习(RLVR)对通用多模态大语言模型(MLLM)进行后训练。然而,这些宝贵的样本未公开,不仅阻碍了研究的可重复性,还将研究实践局限在少数机构。为突破这一障碍,我们做出全新尝试,利用完全公开的数据集训练推理增强型眼底读图MLLM(命名为Fundus-R1),其中超过94%的数据仅标注了图像级标签。我们的技术贡献有两方面:首先,提出基于RAG的方法来构建图像特异的知识感知推理轨迹。这些自动生成的轨迹将通用MLLM识别的视觉发现与眼底医学知识中的图像标签建立关联。其次,我们采用过程奖励增强RLVR,以鼓励每次推演中生成推理轨迹的自一致性。在三个眼底读图基准(FunBench、Omni-Fundus和GMAI-Fundus)上的广泛实验表明,Fundus-R1明显优于多个基线模型,包括其通用版本(Qwen2.5-VL)以及未使用生成轨迹进行后训练的更强版本。该工作为利用公开数据训练强大的眼底读图MLLM奠定了基础。