In the field of medical Vision-Language Pre-training (VLP), significant efforts have been devoted to deriving text and image features from both clinical reports and associated medical images. However, most existing methods may have overlooked the opportunity in leveraging the inherent hierarchical structure of clinical reports, which are generally split into `findings' for descriptive content and `impressions' for conclusive observation. Instead of utilizing this rich, structured format, current medical VLP approaches often simplify the report into either a unified entity or fragmented tokens. In this work, we propose a novel clinical prior guided VLP framework named IMITATE to learn the structure information from medical reports with hierarchical vision-language alignment. The framework derives multi-level visual features from the chest X-ray (CXR) images and separately aligns these features with the descriptive and the conclusive text encoded in the hierarchical medical report. Furthermore, a new clinical-informed contrastive loss is introduced for cross-modal learning, which accounts for clinical prior knowledge in formulating sample correlations in contrastive learning. The proposed model, IMITATE, outperforms baseline VLP methods across six different datasets, spanning five medical imaging downstream tasks. Comprehensive experimental results highlight the advantages of integrating the hierarchical structure of medical reports for vision-language alignment. The code related to this paper is available at https://github.com/cheliu-computation/IMITATE-TMI2024.
翻译:在医学视觉-语言预训练领域,现有研究主要致力于从临床报告及相关医学图像中提取文本与视觉特征。然而,多数方法可能忽视了利用临床报告固有层次结构的机会——此类报告通常分为描述性内容的“所见”与总结性观察的“印象”。当前医学视觉-语言预训练方法往往将报告简化为统一整体或碎片化标记,未能充分利用这种丰富的结构化格式。本研究提出一种新颖的临床先验引导视觉-语言预训练框架IMITATE,通过层次化视觉-语言对齐学习医学报告中的结构信息。该框架从胸部X光影像中提取多层次视觉特征,并分别将其与层次化医学报告中编码的描述性文本和结论性文本进行对齐。此外,我们引入一种新的临床知识感知对比损失函数用于跨模态学习,该函数在构建对比学习样本关联时融入了临床先验知识。所提出的IMITATE模型在涵盖五个医学影像下游任务的六个不同数据集上均优于基线视觉-语言预训练方法。综合实验结果凸显了整合医学报告层次结构对于视觉-语言对齐的优势。本文相关代码已发布于https://github.com/cheliu-computation/IMITATE-TMI2024。