Visual question answering (VQA) is a task where an image is given, and a series of questions are asked about the image. To build an efficient VQA algorithm, a large amount of QA data is required which is very expensive. Generating synthetic QA pairs based on templates is a practical way to obtain data. However, VQA models trained on those data do not perform well on complex, human-written questions. To address this issue, we propose a new method called {\it chain of QA for human-written questions} (CoQAH). CoQAH utilizes a sequence of QA interactions between a large language model and a VQA model trained on synthetic data to reason and derive logical answers for human-written questions. We tested the effectiveness of CoQAH on two types of human-written VQA datasets for 3D-rendered and chest X-ray images and found that it achieved state-of-the-art accuracy in both types of data. Notably, CoQAH outperformed general vision-language models, VQA models, and medical foundation models with no finetuning.
翻译:视觉问答(VQA)是一项给定图像并针对该图像提出一系列问题的任务。为构建高效VQA算法,需要大量问答数据,而此类数据成本高昂。基于模板生成合成问答对是一种获取数据的实用方法,但在此类数据上训练的VQA模型在处理复杂的人类撰写问题时表现不佳。针对该问题,我们提出了一种名为"针对人类撰写问题的问答链"(CoQAH)的新方法。CoQAH利用大语言模型与基于合成数据训练的VQA模型之间的连续问答交互进行推理,从而为人类撰写问题推导出逻辑答案。我们在两种类型的人类撰写VQA数据集(三维渲染图像和胸部X光图像)上验证了CoQAH的有效性,发现其在两类数据上均达到了最先进精度。值得注意的是,CoQAH在无需微调的情况下,其性能超越了通用视觉-语言模型、VQA模型及医学基础模型。