Medical visual question answering (VQA) is a challenging task that requires answering clinical questions of a given medical image, by taking consider of both visual and language information. However, due to the small scale of training data for medical VQA, pre-training fine-tuning paradigms have been a commonly used solution to improve model generalization performance. In this paper, we present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text using medical image caption datasets, by leveraging both unimodal and multimodal contrastive losses, along with masked language modeling and image text matching as pretraining objectives. The pre-trained model is then transferred to downstream medical VQA tasks. The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets with significant accuracy improvements of 2.2%, 14.7%, and 1.7% respectively. Besides, we conduct a comprehensive analysis to validate the effectiveness of different components of the approach and study different pre-training settings. Our codes and models are available at https://github.com/pengfeiliHEU/MUMC.
翻译:医学视觉问答是一项具有挑战性的任务,它要求结合视觉与语言信息,针对给定的医学图像回答临床问题。然而,由于医学视觉问答训练数据的规模较小,预训练-微调范式已成为提升模型泛化性能的常用解决方案。本文提出了一种新颖的自监督方法,该方法利用医学图像描述数据集,通过单模态与多模态对比损失,并结合掩码语言建模与图像文本匹配作为预训练目标,学习输入图像和文本的单模态与多模态特征表示。随后,将预训练模型迁移至下游医学视觉问答任务。所提方法在三个公开医学视觉问答数据集上取得了最优性能,准确率分别显著提升了2.2%、14.7%和1.7%。此外,我们还进行了全面分析,以验证方法不同组件的有效性,并研究了不同的预训练设置。我们的代码与模型已开源至 https://github.com/pengfeiliHEU/MUMC。