EyeMVP: OCT-Informed Fundus Representation Learning via Paired CFP--OCT Pretraining

Zhuo Deng,Ruiheng Zhang,Ziheng Zhang,Weihao Gao,Yitong Li,Qian Wang,Lei Shao,Jiaoyue Dong,Zhixi Zeng,Lijian Fang,Haibo Wang,Xiaobin Lin,Tao Liu,Zhicheng Du,Zhengwei Zhang,Lin Yang,Zheng Gong,Xinyu Zhao,Zhenquan Wu,Fang Li,Zhiguang Zhou,Guoming Zhang,Sun Jing,Han Lv,Wenbin We,Lan Ma

Color fundus photography (CFP) is the mainstay for large-scale retinal screening, yet its diagnostic capacity is constrained by the lack of depth-resolved structural information. Optical coherence tomography (OCT) provides cross-sectional retinal anatomy, but is less accessible in population-level screening. Here, we present EyeMVP, a cross-modal retinal foundation model that uses paired CFP--OCT pretraining to learn OCT-informed CFP representations. EyeMVP is pretrained on 674,893 strict same-eye same-day paired CFP--OCT image triples from 112,642 patients across eight hospitals in China. The model uses cross-modal masked reconstruction to enrich CFP representations with OCT-associated supervision, while requiring only CFP images at inference. To accommodate the non-aligned imaging geometry between en-face CFP and cross-sectional OCT, EyeMVP combines source-constrained cross-attention with CFP-derived structural masks. Across 16 downstream tasks, including classification, segmentation, few-shot adaptation, and cross-modal retrieval, EyeMVP outperforms representative retinal foundation models and shows consistent gains on tasks involving macular and optic nerve structure. For CFP-challenging macular diseases, EyeMVP achieves an AUROC of 0.948 for macular edema (vs.~0.852 for EyeCLIP) and 0.825 for myopic macular schisis. In an exploratory reader study, EyeMVP exceeds junior and intermediate ophthalmologist groups but does not reach senior ophthalmologist performance on macular edema, while showing numerically higher balanced accuracy than all reader groups on myopic macular schisis. These results suggest that pixel-level cross-modal reconstruction can enrich CFP representations with OCT-associated supervision, providing a practical route toward stronger CFP-based retinal analysis in screening settings.

翻译：彩色眼底照相（CFP）是大规模视网膜筛查的主要手段，但其诊断能力受限于缺乏深度分辨的结构信息。光学相干断层扫描（OCT）能提供视网膜的横截面解剖结构，但在人群级别筛查中可及性较低。本文提出EyeMVP，一种跨模态视网膜基础模型，通过配对CFP-OCT预训练学习含有OCT信息的CFP表征。EyeMVP在来自中国八家医院112,642名患者的674,893个严格同眼同日配对的CFP-OCT图像三元组上进行预训练。该模型采用跨模态掩码重建，利用OCT关联监督增强CFP表征，同时在推理时仅需CFP图像。为适配正面CFP与横截面OCT间非对齐的成像几何，EyeMVP结合了源约束交叉注意力与CFP导出的结构掩码。在涵盖分类、分割、少样本适应及跨模态检索的16项下游任务中，EyeMVP优于代表性视网膜基础模型，并在涉及黄斑与视神经结构的任务中表现出一致提升。针对CFP难以诊断的黄斑疾病，EyeMVP对黄斑水肿的AUROC达0.948（对比EyeCLIP的0.852），对近视性黄斑劈裂达0.825。探索性读者研究显示，EyeMVP在黄斑水肿诊断中超过初级与中级眼科医生组但未达高级医生水平，而在近视性黄斑劈裂中其平衡准确率数值上高于所有读者组。这些结果表明，像素级跨模态重建可赋予CFP表征OCT关联监督能力，为筛查场景下更强大的基于CFP的视网膜分析提供了可行路径。