Q2ATransformer: Improving Medical VQA via an Answer Querying Decoder

Medical Visual Question Answering (VQA) systems play a supporting role to understand clinic-relevant information carried by medical images. The questions to a medical image include two categories: close-end (such as Yes/No question) and open-end. To obtain answers, the majority of the existing medical VQA methods relies on classification approaches, while a few works attempt to use generation approaches or a mixture of the two. The classification approaches are relatively simple but perform poorly on long open-end questions. To bridge this gap, in this paper, we propose a new Transformer based framework for medical VQA (named as Q2ATransformer), which integrates the advantages of both the classification and the generation approaches and provides a unified treatment for the close-end and open-end questions. Specifically, we introduce an additional Transformer decoder with a set of learnable candidate answer embeddings to query the existence of each answer class to a given image-question pair. Through the Transformer attention, the candidate answer embeddings interact with the fused features of the image-question pair to make the decision. In this way, despite being a classification-based approach, our method provides a mechanism to interact with the answer information for prediction like the generation-based approaches. On the other hand, by classification, we mitigate the task difficulty by reducing the search space of answers. Our method achieves new state-of-the-art performance on two medical VQA benchmarks. Especially, for the open-end questions, we achieve 79.19% on VQA-RAD and 54.85% on PathVQA, with 16.09% and 41.45% absolute improvements, respectively.

翻译：医学视觉问答（VQA）系统在理解医学图像所携带的临床相关信息中发挥着辅助作用。针对医学图像的问题包括两类：封闭式（如是否问题）和开放式。为获取答案，现有大多数医学VQA方法依赖分类策略，而少数工作尝试使用生成策略或两者混合的方法。分类策略相对简单，但在处理长文本开放式问题时表现不佳。为弥补这一不足，本文提出一种基于Transformer的新型医学VQA框架（命名为Q2ATransformer），该框架融合了分类与生成策略的优势，并为封闭式和开放式问题提供统一处理方案。具体而言，我们引入一个额外的Transformer解码器，其包含一组可学习的候选答案嵌入，用于查询给定图像-问题对中每个答案类别的存在性。通过Transformer注意力机制，候选答案嵌入与图像-问题对的融合特征进行交互以做出决策。通过这种方式，尽管本方法基于分类策略，但它如同生成策略一样提供了与答案信息交互进行预测的机制。另一方面，通过分类策略，我们通过缩小答案搜索空间来降低任务难度。本方法在两个医学VQA基准测试中取得了新的最佳性能。特别是在开放式问题上，我们在VQA-RAD和PathVQA上分别达到79.19%和54.85%的准确率，分别实现了16.09%和41.45%的绝对提升。