Medical visual question answering (VQA) aims to answer clinically relevant questions regarding input medical images. This technique has the potential to improve the efficiency of medical professionals while relieving the burden on the public health system, particularly in resource-poor countries. Existing medical VQA methods tend to encode medical images and learn the correspondence between visual features and questions without exploiting the spatial, semantic, or medical knowledge behind them. This is partially because of the small size of the current medical VQA dataset, which often includes simple questions. Therefore, we first collected a comprehensive and large-scale medical VQA dataset, focusing on chest X-ray images. The questions involved detailed relationships, such as disease names, locations, levels, and types in our dataset. Based on this dataset, we also propose a novel baseline method by constructing three different relationship graphs: spatial relationship, semantic relationship, and implicit relationship graphs on the image regions, questions, and semantic labels. The answer and graph reasoning paths are learned for different questions.
翻译:医学视觉问答旨在回答与输入医学图像相关的临床问题。该技术有潜力提升医疗专业人员的效率,同时减轻公共医疗系统的负担,尤其是在资源匮乏的国家。现有的医学视觉问答方法倾向于直接编码医学图像并学习视觉特征与问题之间的对应关系,而未充分利用图像背后的空间关系、语义关系或医学知识。部分原因在于当前医学VQA数据集规模较小,且通常仅包含简单问题。因此,我们首先收集了一个全面且大规模专注于胸部X光图像的医学VQA数据集。该数据集中涉及的问题包含详细的关联信息,例如疾病名称、位置、严重程度及类型。基于该数据集,我们还提出了一种新颖的基线方法,通过构建三种不同的关系图:图像区域、问题与语义标签上的空间关系图、语义关系图及隐式关系图。针对不同问题,学习对应的答案与图推理路径。