Medical vision-language pre-training has emerged as a promising approach for learning domain-general representations of medical image and text. Current algorithms that exploit the global and local alignment between medical image and text could however be marred by the redundant information in medical data. To address this issue, we propose a grounded knowledge-enhanced medical vision-language pre-training (GK-MVLP) framework for chest X-ray. In this framework, medical knowledge is grounded to the appropriate anatomical regions by using a transformer-based grounded knowledge-enhanced module for fine-grained alignment between anatomical region-level visual features and the textural features of medical knowledge. The performance of GK-MVLP is competitive with or exceeds the state of the art on downstream chest X-ray disease classification, disease localization, report generation, and medical visual question-answering tasks. Our results show the advantage of incorporating grounding mechanism to remove biases and improve the alignment between chest X-ray image and radiology report.
翻译:医学视觉语言预训练已成为学习医学影像与文本通用领域表征的重要方法。现有利用医学影像与文本全局及局部对齐的算法,可能受医学数据中冗余信息的干扰。针对该问题,我们提出了一种基于解剖定位的知识增强型医学视觉语言预训练框架(GK-MVLP)用于胸片分析。该框架通过基于Transformer的解剖定位知识增强模块,将医学知识锚定至特定解剖区域,实现解剖区域级视觉特征与医学知识文本特征之间的细粒度对齐。在下游胸片疾病分类、病灶定位、报告生成及医学视觉问答等任务中,GK-MVLP的性能达到或超越当前最优水平。研究结果表明,引入解剖定位机制可有效消除偏差,并提升胸片影像与放射学报告之间的对齐效果。