Despite strong performance of deep learning models in retinal disease detection, most systems produce static predictions without clinical reasoning or interactive explanation. Recent advances in multimodal large language models (MLLMs) integrate diagnostic predictions with clinically meaningful dialogue to support clinical decision-making and patient counseling. In this study, OcularChat, an MLLM, was fine-tuned from Qwen2.5-VL using simulated patient-physician dialogues to diagnose age-related macular degeneration (AMD) through visual question answering on color fundus photographs (CFPs). A total of 705,850 simulated dialogues paired with 46,167 CFPs were generated to train OcularChat to identify key AMD features and produce reasoned predictions. OcularChat demonstrated strong classification performance in AREDS, achieving accuracies of 0.954, 0.849, and 0.678 for the three diagnostic tasks: advanced AMD, pigmentary abnormalities, and drusen size, significantly outperforming existing MLLMs. On AREDS2, OcularChat remained the top-performing method on all tasks. Across three independent ophthalmologist graders, OcularChat achieved higher mean scores than a strong baseline model for advanced AMD (3.503 vs. 2.833), pigmentary abnormalities (3.272 vs. 2.828), drusen size (3.064 vs. 2.433), and overall impression (2.978 vs. 2.464) on a 5-point clinical grading rubric. Beyond strong objective performance in AMD severity classification, OcularChat demonstrated the ability to provide diagnostic reasoning, clinically relevant explanations, and interactive dialogue, with high performance in subjective ophthalmologist evaluation. These findings suggest that MLLMs may enable accurate, interpretable, and clinically useful image-based diagnosis and classification of AMD.
翻译:尽管深度学习模型在视网膜疾病检测中表现出色,但多数系统仅生成静态预测,缺乏临床推理或交互式解释能力。多模态大语言模型的最新进展将诊断预测与具有临床意义的对话相结合,以支持临床决策和患者咨询。本研究基于Qwen2.5-VL,通过模拟医患对话对多模态大语言模型OcularChat进行微调,使其能够基于彩色眼底照片进行视觉问答以诊断年龄相关性黄斑变性。我们生成了总计705,850组模拟对话(对应46,167张彩色眼底照片)用于训练OcularChat识别关键AMD特征并生成推理预测。在AREDS数据集中,OcularChat在三个诊断任务(晚期AMD、色素异常、玻璃膜疣大小)上分别达到0.954、0.849和0.678的准确率,显著优于现有MLLMs。在AREDS2数据集中,OcularChat在所有任务中仍保持最优性能。经三位独立眼科医师评估,在5分临床评分量表中,OcularChat在晚期AMD(3.503 vs. 2.833)、色素异常(3.272 vs. 2.828)、玻璃膜疣大小(3.064 vs. 2.433)及总体印象(2.978 vs. 2.464)等指标上的平均得分均超过强基线模型。除了在AMD严重程度分类中展现出的卓越客观性能,OcularChat还具备提供诊断推理、临床相关解释及交互式对话的能力,并在主观眼科医师评估中表现优异。这些发现表明,MLLMs有望实现准确、可解释且具有临床实用性的AMD图像诊断与分类。