Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering

Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question. Early studies retrieve required knowledge from explicit knowledge bases (KBs), which often introduces irrelevant information to the question, hence restricting the performance of their models. Recent works have resorted to using a powerful large language model (LLM) as an implicit knowledge engine to acquire the necessary knowledge for answering. Despite the encouraging results achieved by these methods, we argue that they have not fully activated the capacity of the blind LLM as the provided textual input is insufficient to depict the required visual information to answer the question. In this paper, we present Prophet -- a conceptually simple, flexible, and general framework designed to prompt LLM with answer heuristics for knowledge-based VQA. Specifically, we first train a vanilla VQA model on a specific knowledge-based VQA dataset without external knowledge. After that, we extract two types of complementary answer heuristics from the VQA model: answer candidates and answer-aware examples. Finally, the two types of answer heuristics are jointly encoded into a formatted prompt to facilitate the LLM's understanding of both the image and question, thus generating a more accurate answer. By incorporating the state-of-the-art LLM GPT-3, Prophet significantly outperforms existing state-of-the-art methods on four challenging knowledge-based VQA datasets. To demonstrate the generality of our approach, we instantiate Prophet with the combinations of different VQA models (i.e., both discriminative and generative ones) and different LLMs (i.e., both commercial and open-source ones).

翻译：摘要：基于知识的视觉问答（VQA）需要超越图像的外部知识来回答问题。早期研究从显式知识库（KBs）中检索所需知识，但常引入与问题无关的信息，从而限制了模型性能。近期工作转而利用强大语言模型（LLM）作为隐式知识引擎来获取回答问题所需的知识。尽管这些方法取得了令人鼓舞的结果，但我们认为它们尚未完全激活盲大语言模型的能力，因为提供的文本输入不足以描述回答问题所需的视觉信息。本文提出Prophet——一个概念简单、灵活且通用的框架，旨在通过答案启发式提示大语言模型进行基于知识的VQA。具体而言，我们首先在特定基于知识的VQA数据集上训练一个不含外部知识的朴素VQA模型。然后，从该VQA模型中提取两种互补的答案启发式：候选答案和答案感知示例。最后，将这两种答案启发式联合编码为格式化提示，以促进大语言模型对图像和问题的理解，从而生成更准确的答案。通过集成最新大语言模型GPT-3，Prophet在四个具有挑战性的基于知识的VQA数据集上显著优于现有最优方法。为展示方法的通用性，我们实例化了Prophet的不同组合，包括不同VQA模型（即判别式和生成式）以及不同大语言模型（即商业和开源模型）。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日