Medical visual question answering (VQA) is a challenging multimodal task, where Vision-Language Pre-training (VLP) models can effectively improve the generalization performance. However, most methods in the medical field treat VQA as an answer classification task which is difficult to transfer to practical application scenarios. Additionally, due to the privacy of medical images and the expensive annotation process, large-scale medical image-text pairs datasets for pretraining are severely lacking. In this paper, we propose a large-scale MultI-task Self-Supervised learning based framework (MISS) for medical VQA tasks. Unlike existing methods, we treat medical VQA as a generative task. We unify the text encoder and multimodal encoder and align image-text features through multi-task learning. Furthermore, we propose a Transfer-and-Caption method that extends the feature space of single-modal image datasets using large language models (LLMs), enabling those traditional medical vision field task data to be applied to VLP. Experiments show that our method achieves excellent results with fewer multimodal datasets and demonstrates the advantages of generative VQA models. The code and model weights will be released upon the paper's acceptance.
翻译:医学视觉问答(VQA)是一项具有挑战性的多模态任务,视觉语言预训练(VLP)模型可有效提升其泛化性能。然而,医学领域的大多数方法将VQA视为答案分类任务,难以迁移至实际应用场景。此外,由于医学图像的隐私性及标注成本高昂,大规模医学图像-文本配对预训练数据集严重匮乏。本文提出一种基于大规模多任务自监督学习的框架(MISS)用于医学VQA任务。与现有方法不同,我们将医学VQA视为生成式任务,统一文本编码器与多模态编码器,并通过多任务学习对齐图像-文本特征。更进一步,我们提出一种迁移与描述方法,利用大型语言模型(LLMs)扩展单模态图像数据集的特征空间,使传统医学视觉领域的任务数据可应用于VLP。实验表明,本方法在较少多模态数据集条件下仍能取得优异结果,并验证了生成式VQA模型的优势。论文被接收后,代码与模型权重将予以公开。