Medical visual question answering (Med-VQA) aims to automate the prediction of correct answers for medical images and questions, thereby assisting physicians in reducing repetitive tasks and alleviating their workload. Existing approaches primarily focus on pre-training models using additional and comprehensive datasets, followed by fine-tuning to enhance performance in downstream tasks. However, there is also significant value in exploring existing models to extract clinically relevant information. In this paper, we propose the Latent Prompt Assist model (LaPA) for medical visual question answering. Firstly, we design a latent prompt generation module to generate the latent prompt with the constraint of the target answer. Subsequently, we propose a multi-modal fusion block with latent prompt fusion module that utilizes the latent prompt to extract clinical-relevant information from uni-modal and multi-modal features. Additionally, we introduce a prior knowledge fusion module to integrate the relationship between diseases and organs with the clinical-relevant information. Finally, we combine the final integrated information with image-language cross-modal information to predict the final answers. Experimental results on three publicly available Med-VQA datasets demonstrate that LaPA outperforms the state-of-the-art model ARL, achieving improvements of 1.83%, 0.63%, and 1.80% on VQA-RAD, SLAKE, and VQA-2019, respectively. The code is publicly available at https://github.com/GaryGuTC/LaPA_model.
翻译:医学视觉问答(Med-VQA)旨在通过自动化预测医学图像与问题的正确答案,从而辅助医生减少重复性任务并减轻其工作负担。现有方法主要利用额外且全面的数据集对模型进行预训练,随后通过微调提升下游任务性能。然而,探索现有模型以提取临床相关信息同样具有重要价值。本文提出面向医学视觉问答的潜在提示辅助模型(LaPA)。首先,我们设计了潜在提示生成模块,通过目标答案的约束生成潜在提示。随后,提出包含潜在提示融合模块的多模态融合模块,利用潜在提示从单模态与多模态特征中提取临床相关信息。此外,引入先验知识融合模块,将疾病与器官间的关联关系与临床相关信息进行整合。最后,将整合后的最终信息与图像-语言跨模态信息结合,预测最终答案。在三个公开的Med-VQA数据集上的实验结果表明,LaPA在VQA-RAD、SLAKE和VQA-2019上分别以1.83%、0.63%和1.80%的提升优于当前最优模型ARL。代码已开源至https://github.com/GaryGuTC/LaPA_model。