Visual question answering (VQA) is a task where an image is given, and a series of questions are asked about the image. To build an efficient VQA algorithm, a large amount of QA data is required which is very expensive. Generating synthetic QA pairs based on templates is a practical way to obtain data. However, VQA models trained on those data do not perform well on complex, human-written questions. To address this issue, we propose a new method called {\it chain of QA for human-written questions} (CoQAH). CoQAH utilizes a sequence of QA interactions between a large language model and a VQA model trained on synthetic data to reason and derive logical answers for human-written questions. We tested the effectiveness of CoQAH on two types of human-written VQA datasets for 3D-rendered and chest X-ray images and found that it achieved state-of-the-art accuracy in both types of data. Notably, CoQAH outperformed general vision-language models, VQA models, and medical foundation models with no finetuning.
翻译:视觉问答(VQA)是一项给定图像并提出一系列相关问题的任务。为构建高效VQA算法,需要大量问答数据,而这类数据获取成本极高。基于模板生成合成问答对是获取数据的实用方法。然而,在这些数据上训练的VQA模型难以有效处理复杂的人类撰写问题。针对该问题,我们提出一种名为"人类撰写问题问答链"(CoQAH)的新方法。CoQAH利用大语言模型与基于合成数据训练的VQA模型之间的序列化问答交互,对用户提出的问题进行推理并推导出逻辑答案。我们在两类人类撰写的VQA数据集(3D渲染图像与胸部X光图像)上验证了CoQAH的有效性,发现其在两类数据中均达到最优精度。值得注意的是,CoQAH在无需微调的情况下,性能超越了通用视觉语言模型、VQA模型及医学基础模型。