Deep learning has shown great potential in assisting radiologists in reading chest X-ray (CXR) images, but its need for expensive annotations for improving performance prevents widespread clinical application. Visual language pre-training (VLP) can alleviate the burden and cost of annotation by leveraging routinely generated reports for radiographs, which exist in large quantities as well as in paired form (imagetext pairs). Additionally, extensions to localization-aware VLPs are being proposed to address the needs of accurate localization of abnormalities for CAD in CXR. However, we find that the formulation proposed by locality-aware VLP literatures actually leads to loss in spatial relationships required for downstream localization tasks. Therefore, we propose Empowering Locality of VLP with Intra-modal Similarity, ELVIS, a VLP aware of intra-modal locality, to better preserve the locality within radiographs or reports, which enhances the ability to comprehend location references in text reports. Our locality-aware VLP method significantly outperforms state-of-the art baselines in multiple segmentation tasks and the MS-CXR phrase grounding task. Qualitatively, ELVIS is able to focus well on regions of interest described in the report text compared to prior approaches, allowing for enhanced interpretability.
翻译:深度学习在辅助放射科医生阅读胸部X光片(CXR)方面展现出巨大潜力,但其对昂贵标注的需求以提升性能的做法阻碍了广泛的临床应用。视觉语言预训练(VLP)可通过利用常规生成的放射报告(与影像成对存在的大量图像-文本数据)来减轻标注负担与成本。此外,针对计算机辅助诊断中CXR异常精准定位的需求,已提出面向局部感知的VLP扩展方法。然而,我们发现局部感知VLP文献提出的公式实际上会导致下游定位任务所需的空间关系损失。为此,我们提出ELVIS——一种基于模态内相似性增强VLP局部性的方法,通过感知模态内局部性来更好地保留放射影像或报告中的局部特征,从而提升对文本报告中位置指代的理解能力。我们提出的局部感知VLP方法在多个分割任务及MS-CXR短语定位基准上显著优于当前最先进的基线模型。定性分析表明,相较于先前方法,ELVIS能够精准聚焦报告文本中描述的感兴趣区域,增强了模型的可解释性。