Medical visual question answering (Med-VQA) is a machine learning task that aims to create a system that can answer natural language questions based on given medical images. Although there has been rapid progress on the general VQA task, less progress has been made on Med-VQA due to the lack of large-scale annotated datasets. In this paper, we present domain-specific pre-training strategies, including a novel contrastive learning pretraining method, to mitigate the problem of small datasets for the Med-VQA task. We find that the model benefits from components that use fewer parameters. We also evaluate and discuss the model's visual reasoning using evidence verification techniques. Our proposed model obtained an accuracy of 60% on the VQA-Med 2019 test set, giving comparable results to other state-of-the-art Med-VQA models.
翻译:医学视觉问答(Med-VQA)是一项机器学习任务,旨在构建一个能够根据给定医学图像回答自然语言问题的系统。尽管通用VQA任务取得了快速进展,但由于缺乏大规模标注数据集,Med-VQA的进展相对缓慢。本文提出领域特定的预训练策略,包括一种新颖的对比学习预训练方法,以缓解Med-VQA任务中小数据集带来的问题。我们发现,使用参数更少的组件能使模型受益。同时,我们利用证据验证技术评估并讨论了模型的视觉推理能力。所提出的模型在VQA-Med 2019测试集上达到了60%的准确率,与其他最先进的Med-VQA模型性能相当。