We propose a two-stage multimodal framework that enhances disease classification and region-aware radiology report generation from chest X-rays, leveraging the MIMIC-Eye dataset. In the first stage, we introduce a gaze-guided contrastive learning architecture for disease classification. It integrates visual features, clinical labels, bounding boxes, and radiologist eye-tracking signals and is equipped with a novel multi-term gaze-attention loss combining MSE, KL divergence, correlation, and center-of-mass alignment. Incorporating fixations improves F1 score from 0.597 to 0.631 (+5.70%) and AUC from 0.821 to 0.849 (+3.41%), while also improving precision and recall, highlighting the effectiveness of gaze-informed attention supervision. In the second stage, we present a modular report generation pipeline that extracts confidence-weighted diagnostic keywords, maps them to anatomical regions using a curated dictionary constructed from domain-specific priors, and generates region-aligned sentences via structured prompts. This pipeline improves report quality as measured by clinical keyword recall and ROUGE overlap. Our results demonstrate that integrating gaze data improves both classification performance and the interpretability of generated medical reports.
翻译:我们提出了一种两阶段多模态框架,该框架利用MIMIC-Eye数据集,提升了胸部X光片的疾病分类能力和区域感知的放射学报告生成质量。在第一阶段,我们提出了一种用于疾病分类的注视引导对比学习架构。该架构整合了视觉特征、临床标签、边界框和放射科医师的眼动追踪信号,并配备了一种新颖的多项注视注意力损失函数,该函数结合了均方误差、KL散度、相关性和质心对齐。引入注视信息后,F1分数从0.597提升至0.631(+5.70%),AUC从0.821提升至0.849(+3.41%),同时精确率和召回率也得到改善,这凸显了基于注视信息的注意力监督的有效性。在第二阶段,我们提出了一种模块化的报告生成流程,该流程提取置信度加权的诊断关键词,利用基于领域先验知识构建的词典将其映射到解剖区域,并通过结构化提示生成区域对齐的句子。该流程在临床关键词召回率和ROUGE重叠度等指标上提升了报告质量。我们的结果表明,整合眼动数据不仅提高了分类性能,也增强了所生成医学报告的可解释性。