Visual Question Answering (VQA) models aim to answer natural language questions about given images. Due to its ability to ask questions that differ from those used when training the model, medical VQA has received substantial attention in recent years. However, existing medical VQA models typically focus on answering questions that refer to an entire image rather than where the relevant content may be located in the image. Consequently, VQA models are limited in their interpretability power and the possibility to probe the model about specific image regions. This paper proposes a novel approach for medical VQA that addresses this limitation by developing a model that can answer questions about image regions while considering the context necessary to answer the questions. Our experimental results demonstrate the effectiveness of our proposed model, outperforming existing methods on three datasets. Our code and data are available at https://github.com/sergiotasconmorales/locvqa.
翻译:视觉问答(VQA)模型旨在回答关于给定图像的自然语言问题。由于具备提出与模型训练时所使用问题不同的问题的能力,医学视觉问答近年来受到了广泛关注。然而,现有医学VQA模型通常聚焦于回答涉及整张图像的问题,而非可能包含相关内容的图像区域。因此,VQA模型在可解释性以及探究模型关于特定图像区域的可能性方面存在局限。本文提出了一种新颖的医学VQA方法,通过开发能够回答关于图像区域问题并同时考虑回答问题所需上下文的模型来突破这一局限。我们的实验结果表明,所提模型在三个数据集上均优于现有方法,验证了其有效性。我们的代码和数据可在https://github.com/sergiotasconmorales/locvqa获取。