In the field of medical Vision-Language Pre-training (VLP), significant efforts have been devoted to deriving text and image features from both clinical reports and associated medical images. However, most existing methods may have overlooked the opportunity in leveraging the inherent hierarchical structure of clinical reports, which are generally split into `findings' for descriptive content and `impressions' for conclusive observation. Instead of utilizing this rich, structured format, current medical VLP approaches often simplify the report into either a unified entity or fragmented tokens. In this work, we propose a novel clinical prior guided VLP framework named IMITATE to learn the structure information from medical reports with hierarchical vision-language alignment. The framework derives multi-level visual features from the chest X-ray (CXR) images and separately aligns these features with the descriptive and the conclusive text encoded in the hierarchical medical report. Furthermore, a new clinical-informed contrastive loss is introduced for cross-modal learning, which accounts for clinical prior knowledge in formulating sample correlations in contrastive learning. The proposed model, IMITATE, outperforms baseline VLP methods across six different datasets, spanning five medical imaging downstream tasks. Comprehensive experimental results highlight the advantages of integrating the hierarchical structure of medical reports for vision-language alignment.
翻译:在医学视觉-语言预训练(VLP)领域,已有大量工作致力于从临床报告及关联医学图像中提取文本与图像特征。然而,多数现有方法可能忽视了临床报告固有的分层结构潜力——该类报告通常分为描述性内容的“发现”部分和结论性观察的“印象”部分。当前医学VLP方法往往将报告简化为统一实体或碎片化词元,而非利用这一丰富的结构化格式。本文提出一种名为IMITATE的新型临床先验引导VLP框架,通过分层视觉-语言对齐学习医学报告的结构信息。该框架从胸部X光片中提取多层级视觉特征,分别将其与分层医学报告中编码的描述性文本及结论性文本进行对齐。此外,我们引入了一种新的临床知情对比损失函数,该函数在进行跨模态学习时,通过考虑临床先验知识来构建对比学习中的样本相关性。所提出的IMITATE模型在跨越五个医学影像下游任务的六个不同数据集中,均优于基线VLP方法。全面的实验结果凸显了融合医学报告分层结构对视觉-语言对齐的促进作用。