A Multi-Stage Framework for Joint Chest X-Ray Diagnosis and Visual Attention Prediction Using Deep Learning

Purpose: As visual inspection is an inherent process during radiological screening, the associated eye gaze data can provide valuable insights into relevant clinical decisions. As deep learning has become the state-of-the-art for computer-assisted diagnosis, integrating human behavior, such as eye gaze data, into these systems is instrumental to help align machine predictions with clinical diagnostic criteria, thus enhancing the quality of automatic radiological diagnosis. Methods: We propose a novel deep learning framework for joint disease diagnosis and prediction of corresponding clinical visual attention maps for chest X-ray scans. Specifically, we introduce a new dual-encoder multi-task UNet, which leverages both a DenseNet201 backbone and a Residual and Squeeze-and-Excitation block-based encoder to extract diverse features for visual attention map prediction, and a multi-scale feature-fusion classifier to perform disease classification. To tackle the issue of asynchronous training schedules of individual tasks in multi-task learning, we proposed a multi-stage cooperative learning strategy, with contrastive learning for feature encoder pretraining to boost performance. Results: Our proposed method is shown to significantly outperform existing techniques for chest X-ray diagnosis (AUC=0.93) and the quality of visual attention map prediction (Correlation coefficient=0.58). Conclusion: Benefiting from the proposed multi-task multi-stage cooperative learning, our technique demonstrates the benefit of integrating clinicians' eye gaze into clinical AI systems to boost performance and potentially explainability.

翻译：目的：由于视觉检查是放射学筛查中的固有过程，相关的眼动追踪数据能为临床决策提供有价值的见解。随着深度学习成为计算机辅助诊断的最先进技术，将人类行为（如眼动数据）整合到这些系统中，有助于使机器预测与临床诊断标准保持一致，从而提升自动放射诊断的质量。方法：我们提出了一种新颖的深度学习框架，用于对胸部X光扫描进行联合疾病诊断及相应临床视觉注意力图的预测。具体而言，我们引入了一种新型双编码器多任务UNet，该网络同时利用DenseNet201主干网络和基于残差与压缩-激励块的编码器来提取视觉注意力图预测所需的多样化特征，并采用多尺度特征融合分类器进行疾病分类。为应对多任务学习中各任务训练进度异步的问题，我们提出了多阶段协同学习策略，通过对比学习对特征编码器进行预训练以提升性能。结果：实验表明，我们提出的方法在胸部X光诊断（AUC=0.93）和视觉注意力图预测质量（相关系数=0.58）方面显著优于现有技术。结论：得益于所提出的多任务多阶段协同学习，我们的技术证明了将临床医师眼动数据整合到临床AI系统中对提升性能及潜在可解释性的益处。