Deep learning has shown great potential in assisting radiologists in reading chest X-ray (CXR) images, but its need for expensive annotations for improving performance prevents widespread clinical application. Visual language pre-training (VLP) can alleviate the burden and cost of annotation by leveraging routinely generated reports for radiographs, which exist in large quantities as well as in paired form (image-text pairs). Additionally, extensions to localization-aware VLPs are being proposed to address the needs for accurate localization of abnormalities for computer-aided diagnosis (CAD) in CXR. However, we find that the formulation proposed by locality-aware VLP literature actually leads to a loss in spatial relationships required for downstream localization tasks. Therefore, we propose Empowering Locality of VLP with Intra-modal Similarity, ELVIS, a VLP aware of intra-modal locality, to better preserve the locality within radiographs or reports, which enhances the ability to comprehend location references in text reports. Our locality-aware VLP method significantly outperforms state-of-the art baselines in multiple segmentation tasks and the MS-CXR phrase grounding task. Qualitatively, we show that ELVIS focuses well on regions of interest described in the report text compared to prior approaches, allowing for enhanced interpretability.
翻译:深度学习在辅助放射科医生阅读胸部X光(CXR)图像方面展现出巨大潜力,但其依赖昂贵标注来提升性能的做法阻碍了临床广泛推广。视觉语言预训练(VLP)可通过利用临床常规生成的、以大规模配对形式(图像-文本对)存在的报告,减轻标注负担和成本。此外,面向定位感知的VLP扩展方法被提出,以满足CXR计算机辅助诊断(CAD)对异常精准定位的需求。然而,我们发现定位感知VLP文献所提出的公式实际上会导致下游定位任务所需空间关系的丢失。为此,我们提出ELVIS(基于模态内相似性增强VLP局部性),这是一种感知模态内局部性的VLP方法,能更好地保留放射影像或报告中的局部特征,从而提升对文本报告中位置参照的理解能力。我们的局部性感知VLP方法在多项分割任务及MS-CXR短语定位任务中显著优于现有最佳基线。定性分析显示,相比先前方法,ELVIS能更精准聚焦报告文本所述感兴趣区域,从而增强了模型的可解释性。