Knowledge-based visual question answering (VQA) involves questions that require world knowledge beyond the image to yield the correct answer. Large language models (LMs) like GPT-3 are particularly helpful for this task because of their strong knowledge retrieval and reasoning capabilities. To enable LM to understand images, prior work uses a captioning model to convert images into text. However, when summarizing an image in a single caption sentence, which visual entities to describe are often underspecified. Generic image captions often miss visual details essential for the LM to answer visual questions correctly. To address this challenge, we propose PromptCap (Prompt-guided image Captioning), a captioning model designed to serve as a better connector between images and black-box LMs. Different from generic captions, PromptCap takes a natural-language prompt to control the visual entities to describe in the generated caption. The prompt contains a question that the caption should aid in answering. To avoid extra annotation, PromptCap is trained by examples synthesized with GPT-3 and existing datasets. We demonstrate PromptCap's effectiveness on an existing pipeline in which GPT-3 is prompted with image captions to carry out VQA. PromptCap outperforms generic captions by a large margin and achieves state-of-the-art accuracy on knowledge-based VQA tasks (60.4% on OK-VQA and 59.6% on A-OKVQA). Zero-shot results on WebQA show that PromptCap generalizes well to unseen domains.
翻译:摘要:基于知识的视觉问答需借助图像外的世界知识才能得出正确答案。大型语言模型(如GPT-3)因具备强大的知识检索与推理能力,对此任务尤为有效。为让语言模型理解图像,现有工作常采用描述生成模型将图像转为文本。然而,当用单句描述概括图像时,应突出哪些视觉实体往往缺乏明确界定。通用图像描述常缺失回答视觉问题所需的关键细节。针对这一挑战,我们提出PromptCap(提示引导图像描述生成)——一种专为图像与黑箱语言模型间搭建更优连接而设计的描述生成模型。与通用描述不同,PromptCap通过自然语言提示控制生成描述中应包含的视觉实体,该提示包含需借助描述辅助回答的问题。为避免额外标注,PromptCap利用GPT-3与现有数据集合成的样本进行训练。我们在现有流程中验证了PromptCap的有效性:该流程利用图像描述作为提示输入GPT-3以完成视觉问答。实验表明,PromptCap大幅优于通用描述,并在基于知识的视觉问答任务中达到最优精度(OK-VQA任务60.4%,A-OKVQA任务59.6%)。WebQA上的零样本结果进一步证实PromptCap对未见领域具有良好的泛化能力。