In this paper, we focus on the problem of Medical Visual Question Answering (MedVQA), which is crucial in efficiently interpreting medical images with vital clinic-relevant information. Firstly, we reframe the problem of MedVQA as a generation task that naturally follows the human-machine interaction, we propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model. Secondly, we establish a scalable pipeline to construct a large-scale medical visual question-answering dataset, named PMC-VQA, which contains 227k VQA pairs of 149k images that cover various modalities or diseases. Thirdly, we pre-train our proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD and SLAKE, outperforming existing work by a large margin. Additionally, we propose a test set that has undergone manual verification, which is significantly more challenging, even the best models struggle to solve.
翻译:本文聚焦于医学视觉问答(MedVQA)问题,该问题在高效解读医学图像中蕴含的关键临床信息方面具有重要作用。首先,我们将MedVQA问题重构为一种自然遵循人机交互范式的生成任务,通过将预训练视觉编码器中的视觉信息与大语言模型对齐,提出了一种基于生成的医学视觉理解模型。其次,我们建立了一个可扩展的流水线,构建了名为PMC-VQA的大规模医学视觉问答数据集,该数据集包含149k张图像的227k个VQA对,覆盖多种模态或疾病。第三,我们在PMC-VQA上预训练所提模型,随后在多个公开基准(如VQA-RAD和SLAKE)上进行微调,性能显著超越现有工作。此外,我们还提出了一个经过人工校验的测试集,其难度显著更高,即使最优模型也难以解决。