In this paper, we focus on the problem of Medical Visual Question Answering (MedVQA), which is crucial in efficiently interpreting medical images with vital clinic-relevant information. Firstly, we reframe the problem of MedVQA as a generation task that naturally follows the human-machine interaction, we propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model. Secondly, we establish a scalable pipeline to construct a large-scale medical visual question-answering dataset, named PMC-VQA, which contains 227k VQA pairs of 149k images that cover various modalities or diseases. Thirdly, we pre-train our proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD and SLAKE, outperforming existing work by a large margin. Additionally, we propose a test set that has undergone manual verification, which is significantly more challenging, even the best models struggle to solve.
翻译:本文聚焦于医学视觉问答(MedVQA)问题,该问题在高效解读蕴含关键临床信息的医学图像中至关重要。首先,我们将MedVQA重新定义为遵循人机交互自然模式的生成任务,通过将预训练视觉编码器的视觉信息与大语言模型对齐,提出了一种基于生成的医学视觉理解模型。其次,我们构建了一个可扩展的管线,用于创建大规模医学视觉问答数据集PMC-VQA,该数据集包含149k张图像的227k组VQA对,覆盖多种模态和疾病类型。再次,我们在PMC-VQA上预训练所提模型,并在VQA-RAD和SLAKE等多个公开基准上微调,以较大幅度超越现有方法。此外,我们提出了经过人工验证的测试集,该测试集更具挑战性,即使最优模型也难以解决。