In this paper, we focus on the problem of Medical Visual Question Answering (MedVQA), which is crucial in efficiently interpreting medical images with vital clinic-relevant information. Firstly, we reframe the problem of MedVQA as a generation task that naturally follows the human-machine interaction, we propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model. Secondly, we establish a scalable pipeline to construct a large-scale medical visual question-answering dataset, named PMC-VQA, which contains 227k VQA pairs of 149k images that cover various modalities or diseases. Thirdly, we pre-train our proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD and SLAKE, outperforming existing work by a large margin. Additionally, we propose a test set that has undergone manual verification, which is significantly more challenging, even the best models struggle to solve.
翻译:本文聚焦于医学视觉问答(MedVQA)问题,该问题在高效解读包含关键临床信息的医学图像中至关重要。首先,我们将MedVQA重新定义为一种自然遵循人机交互范式的生成任务,通过将预训练视觉编码器中的视觉信息与大语言模型对齐,提出了一种基于生成的医学视觉理解模型。其次,我们建立了一个可扩展的数据构建流程,用于创建大规模医学视觉问答数据集PMC-VQA,该数据集包含来自149k张图像的227k个VQA对,覆盖多种模态和疾病类型。第三,我们在PMC-VQA上预训练所提模型,并在VQA-RAD和SLAKE等多个公开基准数据集上微调,相比于现有方法实现了大幅性能提升。此外,我们提出了一份经过人工验证的测试集,其难度显著更高,当前最优模型在该测试集上的表现仍然有限。