Medical Visual Question Answering (MedVQA), which offers language responses to image-based medical inquiries, represents a challenging task and significant advancement in healthcare. It assists medical experts to swiftly interpret medical images, thereby enabling faster and more accurate diagnoses. However, the model interpretability and transparency of existing MedVQA solutions are often limited, posing challenges in understanding their decision-making processes. To address this issue, we devise a semi-automated annotation process to streamlining data preparation and build new benchmark MedVQA datasets R-RAD and R-SLAKE. The R-RAD and R-SLAKE datasets provide intermediate medical decision-making rationales generated by multimodal large language models and human annotations for question-answering pairs in existing MedVQA datasets, i.e., VQA-RAD and SLAKE. Moreover, we design a novel framework which finetunes lightweight pretrained generative models by incorporating medical decision-making rationales into the training process. The framework includes three distinct strategies to generate decision outcomes and corresponding rationales, thereby clearly showcasing the medical decision-making process during reasoning. Extensive experiments demonstrate that our method can achieve an accuracy of 83.5% on R-RAD and 86.3% on R-SLAKE, significantly outperforming existing state-of-the-art baselines. Dataset and code will be released.
翻译:医学视觉问答(MedVQA)旨在对基于图像的医学查询提供语言回答,是一项具有挑战性的任务,也是医疗领域的重要进步。它帮助医疗专家快速解读医学图像,从而实现更快速、更准确的诊断。然而,现有MedVQA解决方案的模型可解释性和透明度通常有限,导致难以理解其决策过程。为解决这一问题,我们设计了一种半自动标注流程以简化数据准备,并构建了新的基准MedVQA数据集R-RAD和R-SLAKE。R-RAD和R-SLAKE数据集提供了由多模态大语言模型生成的中间医学决策推理,以及现有MedVQA数据集(即VQA-RAD和SLAKE)中问答对的人工标注。此外,我们设计了一个新型框架,通过将医学决策推理融入训练过程来微调轻量级预训练生成模型。该框架包含三种不同策略,用于生成决策结果及相应的推理过程,从而在推理过程中清晰展示医学决策流程。大量实验表明,我们的方法在R-RAD上达到83.5%的准确率,在R-SLAKE上达到86.3%的准确率,显著优于现有最先进基线。数据集和代码将发布。