Knowledge-based visual question answering (VQA) involves questions that require world knowledge beyond the image to yield the correct answer. Large language models (LMs) like GPT-3 are particularly helpful for this task because of their strong knowledge retrieval and reasoning capabilities. To enable LM to understand images, prior work uses a captioning model to convert images into text. However, when summarizing an image in a single caption sentence, which visual entities to describe are often underspecified. Generic image captions often miss visual details essential for the LM to answer visual questions correctly. To address this challenge, we propose PromptCap (Prompt-guided image Captioning), a captioning model designed to serve as a better connector between images and black-box LMs. Different from generic captions, PromptCap takes a natural-language prompt to control the visual entities to describe in the generated caption. The prompt contains a question that the caption should aid in answering. To avoid extra annotation, PromptCap is trained by examples synthesized with GPT-3 and existing datasets. We demonstrate PromptCap's effectiveness on an existing pipeline in which GPT-3 is prompted with image captions to carry out VQA. PromptCap outperforms generic captions by a large margin and achieves state-of-the-art accuracy on knowledge-based VQA tasks (60.4% on OK-VQA and 59.6% on A-OKVQA). Zero-shot results on WebQA show that PromptCap generalizes well to unseen domains.
翻译:基于知识的视觉问答需要超越图像内容的世界知识才能得出正确答案。像GPT-3这样的大型语言模型凭借其强大的知识检索和推理能力,对此类任务尤为有效。为使语言模型理解图像,先前的工作使用字幕生成模型将图像转换为文本。然而,当用一句话总结图像时,需要描述的视觉实体往往不够明确。通用图像字幕常常缺失语言模型正确回答视觉问题所必需的视觉细节。为解决这一挑战,我们提出PromptCap(提示引导的图像字幕生成),这是一种旨在成为图像与黑盒语言模型之间更好桥梁的字幕生成模型。与通用字幕不同,PromptCap通过自然语言提示来控制生成字幕中需要描述的视觉实体。该提示包含一个需要由字幕帮助回答的问题。为避免额外标注,PromptCap通过使用GPT-3和现有数据集合成的样本进行训练。我们在现有流水线上验证了PromptCap的有效性——该流水线使用图像字幕提示GPT-3执行视觉问答。PromptCap大幅超越通用字幕,并在基于知识的视觉问答任务上达到最先进水平(OK-VQA上60.4%,A-OKVQA上59.6%)。在WebQA上的零样本结果表明,PromptCap能良好地泛化至未见领域。