Multimodal pre-training demonstrates its potential in the medical domain, which learns medical visual representations from paired medical reports. However, many pre-training tasks require extra annotations from clinicians, and most of them fail to explicitly guide the model to learn the desired features of different pathologies. In this paper, we utilize Visual Question Answering (VQA) for multimodal pre-training to guide the framework focusing on targeted pathological features. We leverage descriptions in medical reports to design multi-granular question-answer pairs associated with different diseases, which assist the framework in pre-training without requiring extra annotations from experts. We also propose a novel pre-training framework with a quasi-textual feature transformer, a module designed to transform visual features into a quasi-textual space closer to the textual domain via a contrastive learning strategy. This narrows the vision-language gap and facilitates modality alignment. Our framework is applied to four downstream tasks: report generation, classification, segmentation, and detection across five datasets. Extensive experiments demonstrate the superiority of our framework compared to other state-of-the-art methods. Our code is available at https://github.com/MoramiSu/QFT-MICCAI2024.
翻译:多模态预训练在医学领域展现出巨大潜力,其通过配对的医学报告学习医学视觉表征。然而,许多预训练任务需要临床医生提供额外标注,且大多未能明确引导模型学习不同病理所需的特征。本文利用视觉问答(VQA)进行多模态预训练,以引导框架聚焦于目标病理特征。我们利用医学报告中的描述,设计出与不同疾病相关联的多粒度问答对,从而在无需专家额外标注的情况下辅助框架进行预训练。同时,我们提出一种新颖的预训练框架,其中包含一个准文本特征转换器——该模块通过对比学习策略将视觉特征转换到更接近文本域的准文本空间。这缩小了视觉与语言之间的鸿沟,并促进了模态对齐。我们的框架应用于五个数据集上的四个下游任务:报告生成、分类、分割与检测。大量实验证明,相较于其他先进方法,我们的框架具有显著优势。代码已开源:https://github.com/MoramiSu/QFT-MICCAI2024。