Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question. Early studies retrieve required knowledge from explicit knowledge bases (KBs), which often introduces irrelevant information to the question, hence restricting the performance of their models. Recent works have resorted to using a powerful large language model (LLM) as an implicit knowledge engine to acquire the necessary knowledge for answering. Despite the encouraging results achieved by these methods, we argue that they have not fully activated the capacity of the blind LLM as the provided textual input is insufficient to depict the required visual information to answer the question. In this paper, we present Prophet -- a conceptually simple, flexible, and general framework designed to prompt LLM with answer heuristics for knowledge-based VQA. Specifically, we first train a vanilla VQA model on a specific knowledge-based VQA dataset without external knowledge. After that, we extract two types of complementary answer heuristics from the VQA model: answer candidates and answer-aware examples. Finally, the two types of answer heuristics are jointly encoded into a formatted prompt to facilitate the LLM's understanding of both the image and question, thus generating a more accurate answer. By incorporating the state-of-the-art LLM GPT-3, Prophet significantly outperforms existing state-of-the-art methods on four challenging knowledge-based VQA datasets. To demonstrate the generality of our approach, we instantiate Prophet with the combinations of different VQA models (i.e., both discriminative and generative ones) and different LLMs (i.e., both commercial and open-source ones).
翻译:摘要:基于知识的视觉问答(VQA)需要超越图像的外部知识来回答问题。早期研究从显式知识库(KBs)中检索所需知识,但常引入与问题无关的信息,从而限制了模型性能。近期工作转而利用强大语言模型(LLM)作为隐式知识引擎来获取回答问题所需的知识。尽管这些方法取得了令人鼓舞的结果,但我们认为它们尚未完全激活盲大语言模型的能力,因为提供的文本输入不足以描述回答问题所需的视觉信息。本文提出Prophet——一个概念简单、灵活且通用的框架,旨在通过答案启发式提示大语言模型进行基于知识的VQA。具体而言,我们首先在特定基于知识的VQA数据集上训练一个不含外部知识的朴素VQA模型。然后,从该VQA模型中提取两种互补的答案启发式:候选答案和答案感知示例。最后,将这两种答案启发式联合编码为格式化提示,以促进大语言模型对图像和问题的理解,从而生成更准确的答案。通过集成最新大语言模型GPT-3,Prophet在四个具有挑战性的基于知识的VQA数据集上显著优于现有最优方法。为展示方法的通用性,我们实例化了Prophet的不同组合,包括不同VQA模型(即判别式和生成式)以及不同大语言模型(即商业和开源模型)。