Automating medical reports for retinal images requires a sophisticated blend of visual pattern recognition and deep clinical knowledge. Current Large Vision-Language Models (LVLMs) often struggle in specialized medical fields where data is scarce, leading to models that overfit and miss subtle but critical pathologies. To address this, we introduce DREAM (Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion), a novel framework for high-fidelity medical report generation that excels even with limited data. DREAM employs a unique two-stage fusion mechanism that intelligently integrates visual data with clinical keywords curated by ophthalmologists. First, the Abstractor module maps image and keyword features into a shared space, enhancing visual data with pathology-relevant insights. Next, the Adaptor performs adaptive multi-modal fusion, dynamically weighting the importance of each modality using learnable parameters to create a unified representation. To ensure the model's outputs are semantically grounded in clinical reality, a Contrastive Alignment module aligns these fused representations with ground-truth medical reports during training. By combining medical expertise with an efficient fusion strategy, DREAM sets a new state-of-the-art on the DeepEyeNet benchmark, achieving a BLEU-4 score of 0.241, and further demonstrates strong generalization to the ROCO dataset.
翻译:自动化视网膜图像的医疗报告生成需要视觉模式识别与深层临床知识的精妙结合。当前大型视觉语言模型在数据稀缺的专科医疗领域常面临困境,导致模型过拟合且遗漏细微但关键的病理特征。为解决此问题,我们提出DREAM(动态视网膜增强与自适应多模态融合)框架——一种即便在有限数据下也能实现高保真医疗报告生成的创新方案。DREAM采用独特的双阶段融合机制,智能整合视觉数据与眼科专家标注的临床关键词:首先,抽象器模块将图像与关键词特征映射至共享空间,通过病理相关洞察增强视觉数据;随后,适配器执行自适应多模态融合,利用可学习参数动态加权各模态重要性以构建统一表征。为确保模型输出的语义与临床现实保持逻辑关联,对比对齐模块在训练过程中将这些融合表征与真实医疗报告进行对齐。通过将医学专业知识与高效融合策略相结合,DREAM在DeepEyeNet基准上树立了新标杆(BLEU-4得分0.241),并在ROCO数据集上展现出强大的泛化能力。