In recent years, the growing demand for medical imaging diagnosis has placed a significant burden on radiologists. As a solution, Medical Vision-Language Pre-training (Med-VLP) methods have been proposed to learn universal representations from medical images and reports, benefiting downstream tasks without requiring fine-grained annotations. However, existing methods have overlooked the importance of cross-modal alignment in joint image-text reconstruction, resulting in insufficient cross-modal interaction. To address this limitation, we propose a unified Med-VLP framework based on Multi-task Paired Masking with Alignment (MPMA) to integrate the cross-modal alignment task into the joint image-text reconstruction framework to achieve more comprehensive cross-modal interaction, while a Global and Local Alignment (GLA) module is designed to assist self-supervised paradigm in obtaining semantic representations with rich domain knowledge. Furthermore, we introduce a Memory-Augmented Cross-Modal Fusion (MA-CMF) module to fully integrate visual information to assist report reconstruction and fuse the multi-modal representations adequately. Experimental results demonstrate that the proposed unified approach outperforms previous methods in all downstream tasks, including uni-modal, cross-modal, and multi-modal tasks.
翻译:近年来,医学影像诊断需求的日益增长给放射科医生带来了沉重负担。为此,医学视觉-语言预训练(Med-VLP)方法被提出,旨在从医学图像和报告中学习通用表征,从而无需细粒度标注即可赋能下游任务。然而,现有方法忽视了跨模态对齐在联合图像-文本重建中的重要性,导致跨模态交互不足。针对这一局限,我们提出基于多任务配对掩码与对齐(MPMA)的统一Med-VLP框架,将跨模态对齐任务融入联合图像-文本重建框架中,以实现更全面的跨模态交互;同时设计全局与局部对齐(GLA)模块,辅助自监督范式获取富含领域知识的语义表征。进一步地,我们引入记忆增强跨模态融合(MA-CMF)模块,充分整合视觉信息以辅助报告重建,并实现多模态表征的充分融合。实验结果表明,该统一方法在包括单模态、跨模态及多模态在内的所有下游任务中均优于先前方法。