Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question. Early studies retrieve required knowledge from explicit knowledge bases (KBs), which often introduces irrelevant information to the question, hence restricting the performance of their models. Recent works have sought to use a large language model (i.e., GPT-3) as an implicit knowledge engine to acquire the necessary knowledge for answering. Despite the encouraging results achieved by these methods, we argue that they have not fully activated the capacity of GPT-3 as the provided input information is insufficient. In this paper, we present Prophet -- a conceptually simple framework designed to prompt GPT-3 with answer heuristics for knowledge-based VQA. Specifically, we first train a vanilla VQA model on a specific knowledge-based VQA dataset without external knowledge. After that, we extract two types of complementary answer heuristics from the model: answer candidates and answer-aware examples. Finally, the two types of answer heuristics are encoded into the prompts to enable GPT-3 to better comprehend the task thus enhancing its capacity. Prophet significantly outperforms all existing state-of-the-art methods on two challenging knowledge-based VQA datasets, OK-VQA and A-OKVQA, delivering 61.1% and 55.7% accuracies on their testing sets, respectively.
翻译:知识型视觉问答(VQA)需要超越图像本身的外部知识来回答问题。早期研究从显式知识库(KB)中检索所需知识,但这常会引入与问题无关的信息,从而限制了模型性能。近期工作尝试将大型语言模型(如GPT-3)作为隐式知识引擎,以获取回答所需的必要知识。尽管这些方法取得了令人鼓舞的结果,但我们认为它们尚未完全激活GPT-3的能力,因为提供的输入信息不够充分。本文提出Prophet——一个概念上简单的框架,旨在通过答案启发策略提示GPT-3进行知识型VQA。具体而言,我们首先在特定知识型VQA数据集上训练一个无需外部知识的普通VQA模型,然后从该模型中提取两类互补的答案启发信息:答案候选集与答案感知示例。最后,将这两类答案启发信息编码为提示,使GPT-3更好地理解任务并增强其能力。在两个具有挑战性的知识型VQA数据集OK-VQA和A-OKVQA上,Prophet显著超越现有所有最先进方法,在测试集上分别取得61.1%和55.7%的准确率。