Current fundus image analysis models are predominantly built for specific tasks relying on individual datasets. The learning process is usually based on data-driven paradigm without prior knowledge, resulting in poor transferability and generalizability. To address this issue, we propose MM-Retinal, a multi-modal dataset that encompasses high-quality image-text pairs collected from professional fundus diagram books. Moreover, enabled by MM-Retinal, we present a novel Knowledge-enhanced foundational pretraining model which incorporates Fundus Image-Text expertise, called KeepFIT. It is designed with image similarity-guided text revision and mixed training strategy to infuse expert knowledge. Our proposed fundus foundation model achieves state-of-the-art performance across six unseen downstream tasks and holds excellent generalization ability in zero-shot and few-shot scenarios. MM-Retinal and KeepFIT are available at https://github.com/lxirich/MM-Retinal.
翻译:当前眼底图像分析模型主要针对单个数据集构建特定任务,其学习过程通常基于无先验知识的数据驱动范式,导致可迁移性与泛化能力不足。为解决此问题,我们提出了MM-Retinal——一个包含从专业眼底图谱中采集的高质量图文对的多模态数据集。此外,依托MM-Retinal,我们提出了融合眼底图像-文本专业知识的新型知识增强基础预训练模型KeepFIT。该模型通过图像相似度引导的文本修正策略与混合训练策略注入专家知识。我们提出的眼底基础模型在六个未见下游任务中均取得最优性能,并在零样本与少样本场景下展现出卓越的泛化能力。MM-Retinal和KeepFIT的代码已开源至https://github.com/lxirich/MM-Retinal。