In recent years, the growing demand for medical imaging diagnosis has brought a significant burden to radiologists. The existing Med-VLP methods provide a solution for automated medical image analysis which learns universal representations from large-scale medical images and reports and benefits downstream tasks without requiring fine-grained annotations. However, the existing methods based on joint image-text reconstruction neglect the importance of cross-modal alignment in conjunction with joint reconstruction, resulting in inadequate cross-modal interaction. In this paper, we propose a unified Med-VLP framework based on Multi-task Paired Masking with Alignment (MPMA) to integrate the cross-modal alignment task into the joint image-text reconstruction framework to achieve more comprehensive cross-modal interaction, while a global and local alignment (GLA) module is designed to assist self-supervised paradigm in obtaining semantic representations with rich domain knowledge. To achieve more comprehensive cross-modal fusion, we also propose a Memory-Augmented Cross-Modal Fusion (MA-CMF) module to fully integrate visual features to assist in the process of report reconstruction. Experimental results show that our approach outperforms previous methods over all downstream tasks, including uni-modal, cross-modal and multi-modal tasks.
翻译:近年来,医学影像诊断需求的日益增长给放射科医生带来了沉重负担。现有的医学视觉-语言预训练(Med-VLP)方法通过从大规模医学图像和报告中学习通用表征,无需细粒度标注即可赋能下游任务,为自动化医学图像分析提供了解决方案。然而,现有基于联合图像-文本重构的方法忽略了跨模态对齐与联合重构相结合的重要性,导致跨模态交互不充分。本文提出一种基于对齐的多任务配对掩码统一Med-VLP框架(MPMA),将跨模态对齐任务融入联合图像-文本重构框架,以实现更全面的跨模态交互,同时设计全局与局部对齐(GLA)模块辅助自监督范式获取富含领域知识的语义表征。为达成更全面的跨模态融合,我们还提出记忆增强跨模态融合(MA-CMF)模块,充分整合视觉特征以辅助报告重构过程。实验结果表明,本方法在包括单模态、跨模态和多模态任务在内的所有下游任务中均优于先前方法。