In this paper, we consider enhancing medical visual-language pre-training (VLP) with domain-specific knowledge, by exploiting the paired image-text reports from the radiological daily practice. In particular, we make the following contributions: First, unlike existing works that directly process the raw reports, we adopt a novel triplet extraction module to extract the medical-related information, avoiding unnecessary complexity from language grammar and enhancing the supervision signals; Second, we propose a novel triplet encoding module with entity translation by querying a knowledge base, to exploit the rich domain knowledge in medical field, and implicitly build relationships between medical entities in the language embedding space; Third, we propose to use a Transformer-based fusion model for spatially aligning the entity description with visual signals at the image patch level, enabling the ability for medical diagnosis; Fourth, we conduct thorough experiments to validate the effectiveness of our architecture, and benchmark on numerous public benchmarks, e.g., ChestX-ray14, RSNA Pneumonia, SIIM-ACR Pneumothorax, COVIDx CXR-2, COVID Rural, and EdemaSeverity. In both zero-shot and fine-tuning settings, our model has demonstrated strong performance compared with the former methods on disease classification and grounding.
翻译:本文通过利用放射学日常实践中配对的图像-文本报告,探索利用领域特定知识增强医学视觉-语言预训练(VLP)。具体而言,我们做出以下贡献:第一,与直接处理原始报告的现有工作不同,我们采用新颖的三元组提取模块来提取医学相关信息,避免语言语法带来的非必要复杂性并增强监督信号;第二,我们提出一种新颖的三元组编码模块,通过查询知识库进行实体翻译,以利用医学领域丰富的领域知识,并在语言嵌入空间中隐式构建医学实体之间的关系;第三,我们提出使用基于Transformer的融合模型,在图像块级别将实体描述与视觉信号进行空间对齐,从而赋予医学诊断能力;第四,我们进行充分的实验以验证我们架构的有效性,并在多个公开基准上进行评估,例如ChestX-ray14、RSNA Pneumonia、SIIM-ACR Pneumothorax、COVIDx CXR-2、COVID Rural和EdemaSeverity。在零样本和微调设置下,我们的模型在疾病分类和定位任务上均展现出优于以往方法的强劲性能。