Medical visual question answering (VQA) is a challenging multimodal task, where Vision-Language Pre-training (VLP) models can effectively improve the generalization performance. However, most methods in the medical field treat VQA as an answer classification task which is difficult to transfer to practical application scenarios. Additionally, due to the privacy of medical images and the expensive annotation process, large-scale medical image-text pairs datasets for pretraining are severely lacking. In this paper, we propose a large-scale MultI-task Self-Supervised learning based framework (MISS) for medical VQA tasks. Unlike existing methods, we treat medical VQA as a generative task. We unify the text encoder and multimodal encoder and align image-text features through multi-task learning. Furthermore, we propose a Transfer-and-Caption method that extends the feature space of single-modal image datasets using large language models (LLMs), enabling those traditional medical vision field task data to be applied to VLP. Experiments show that our method achieves excellent results with fewer multimodal datasets and demonstrates the advantages of generative VQA models. The code and model weights will be released upon the paper's acceptance.
翻译:医学视觉问答是一项具有挑战性的多模态任务,视觉-语言预训练模型可有效提升其泛化性能。然而,医学领域现有大多方法将视觉问答视为答案分类任务,这难以迁移至实际应用场景。此外,由于医学图像的隐私性及标注过程的高昂成本,大规模医学图像-文本对预训练数据集严重匮乏。本文提出一种基于大规模多任务自监督学习的框架MISS,用于医学视觉问答任务。与现有方法不同,我们将医学视觉问答视为生成式任务,通过统一文本编码器与多模态编码器,并借助多任务学习对齐图像-文本特征。同时,我们提出传输-描述方法,利用大语言模型扩展单模态图像数据集的特征空间,使得传统医学视觉领域任务数据可应用于视觉-语言预训练。实验表明,本方法在更少的多模态数据集下仍能取得优异效果,并验证了生成式视觉问答模型的优势。代码与模型权重将在论文接收后公开。