In the field of medical Vision-Language Pre-training (VLP), significant efforts have been devoted to deriving text and image features from both clinical reports and associated medical images. However, most existing methods may have overlooked the opportunity in leveraging the inherent hierarchical structure of clinical reports, which are generally split into `findings' for descriptive content and `impressions' for conclusive observation. Instead of utilizing this rich, structured format, current medical VLP approaches often simplify the report into either a unified entity or fragmented tokens. In this work, we propose a novel clinical prior guided VLP framework named IMITATE to learn the structure information from medical reports with hierarchical vision-language alignment. The framework derives multi-level visual features from the chest X-ray (CXR) images and separately aligns these features with the descriptive and the conclusive text encoded in the hierarchical medical report. Furthermore, a new clinical-informed contrastive loss is introduced for cross-modal learning, which accounts for clinical prior knowledge in formulating sample correlations in contrastive learning. The proposed model, IMITATE, outperforms baseline VLP methods across six different datasets, spanning five medical imaging downstream tasks. Comprehensive experimental results highlight the advantages of integrating the hierarchical structure of medical reports for vision-language alignment.
翻译:在医学视觉语言预训练领域,已有大量研究致力于从临床报告及相关医学图像中提取文本与视觉特征。然而,现有方法大多可能忽视了利用临床报告固有层次结构的机会——临床报告通常分为描述性内容的“发现”部分和总结性观察的“印象”部分。当前的医学视觉语言预训练方法并未利用这种丰富的结构化格式,而往往将报告简化为单一整体或碎片化标记。本研究提出了一种名为IMITATE的新型临床先验引导视觉语言预训练框架,旨在通过分层视觉语言对齐学习医学报告的结构信息。该框架从胸部X光图像中提取多层次视觉特征,并分别将这些特征与分层医学报告中编码的描述性文本和结论性文本进行对齐。此外,我们引入了一种新的临床知识感知对比损失函数用于跨模态学习,该函数在对比学习中依据临床先验知识构建样本相关性。所提出的IMITATE模型在涵盖五个医学影像下游任务的六个不同数据集上均优于基线视觉语言预训练方法。综合实验结果凸显了融合医学报告层次结构进行视觉语言对齐的优势。