Visual Question Answering (VQA) in the medical domain presents a unique, interdisciplinary challenge, combining fields such as Computer Vision, Natural Language Processing, and Knowledge Representation. Despite its importance, research in medical VQA has been scant, only gaining momentum since 2018. Addressing this gap, our research delves into the effective representation of radiology images and the joint learning of multimodal representations, surpassing existing methods. We innovatively augment the SLAKE dataset, enabling our model to respond to a more diverse array of questions, not limited to the immediate content of radiology or pathology images. Our model achieves a top-1 accuracy of 79.55\% with a less complex architecture, demonstrating comparable performance to current state-of-the-art models. This research not only advances medical VQA but also opens avenues for practical applications in diagnostic settings.
翻译:医学领域的视觉问答(VQA)是一项独特的跨学科挑战,融合了计算机视觉、自然语言处理和知识表示等领域。尽管其重要性不言而喻,但医学VQA的研究一直较为匮乏,直到2018年才逐渐兴起。为弥补这一空白,本研究深入探索了放射学图像的有效表示以及多模态表示的联合学习,超越了现有方法。我们创新性地扩充了SLAKE数据集,使模型能够回答更多样化的问题,而不仅限于放射学或病理学图像的即时内容。我们的模型以较简单的架构实现了79.55%的Top-1准确率,展现了与当前最先进模型相当的性能。这项研究不仅推动了医学VQA的发展,也为诊断场景中的实际应用开辟了新途径。