Medical Visual Question Answering (VQA) is an important challenge, as it would lead to faster and more accurate diagnoses and treatment decisions. Most existing methods approach it as a multi-class classification problem, which restricts the outcome to a predefined closed-set of curated answers. We focus on open-ended VQA and motivated by the recent advances in language models consider it as a generative task. Leveraging pre-trained language models, we introduce a novel method particularly suited for small, domain-specific, medical datasets. To properly communicate the medical images to the language model, we develop a network that maps the extracted visual features to a set of learnable tokens. Then, alongside the question, these learnable tokens directly prompt the language model. We explore recent parameter-efficient fine-tuning strategies for language models, which allow for resource- and data-efficient fine-tuning. We evaluate our approach on the prime medical VQA benchmarks, namely, Slake, OVQA and PathVQA. The results demonstrate that our approach outperforms existing methods across various training settings while also being computationally efficient.
翻译:医学视觉问答是一项重要挑战,因其有望实现更快速、更准确的诊断和治疗决策。现有方法大多将其视为多类别分类问题,将输出结果限制在预定义的封闭式答案集合中。我们聚焦于开放式视觉问答,并受到语言模型最新进展的启发,将其视为生成任务。通过利用预训练语言模型,我们提出了一种特别适用于小型领域特定医学数据集的新方法。为将医学图像有效传达给语言模型,我们构建了一个网络,将提取的视觉特征映射至一组可学习标记。随后,这些可学习标记与问题一同直接引导语言模型。我们探索了针对语言模型的最新参数高效微调策略,实现了资源与数据的高效微调。我们在主要医学视觉问答基准测试(Slake、OVQA 和 PathVQA)上评估了该方法。结果表明,我们的方法在不同训练设置下均优于现有方法,同时兼具计算高效性。