Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnection and task disconnection between LLM and VQA task. End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive. To address this issue, we propose \emph{Img2Prompt}, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform zero-shot VQA tasks without end-to-end training. In order to provide such prompts, we further employ LLM-agnostic models to provide prompts that can describe image content and self-constructed question-answer pairs, which can effectively guide LLM to perform zero-shot VQA tasks. Img2Prompt offers the following benefits: 1) It can flexibly work with various LLMs to perform VQA. 2)~Without the needing of end-to-end training, it significantly reduces the cost of deploying LLM for zero-shot VQA tasks. 3) It achieves comparable or better performance than methods relying on end-to-end training. For example, we outperform Flamingo \cite{Deepmind:Flamingo2022} by 5.6\% on VQAv2. On the challenging A-OKVQA dataset, our method even outperforms few-shot methods by as much as 20\%.
翻译:大型语言模型(LLM)在新语言任务中展现出卓越的零样本泛化能力。然而,如何有效利用LLM进行零样本视觉问答(VQA)仍面临挑战,主要源于LLM与VQA任务之间的模态断裂与任务断裂。端到端的视觉-语言数据训练虽可弥合此类断裂,但存在灵活性差、计算成本高昂等问题。为此,本文提出**Img2Prompt**——一种即插即用模块,它能够生成可桥接前述模态与任务断裂的提示,使LLM无需端到端训练即可执行零样本VQA任务。为生成此类提示,我们进一步采用LLM无关模型,通过描述图像内容与自构建的问答对生成提示,有效引导LLM执行零样本VQA任务。Img2Prompt具有以下优势:1)可灵活适配多种LLM执行VQA任务;2)无需端到端训练,大幅降低LLM部署于零样本VQA任务的成本;3)在性能上可媲美甚至超越依赖端到端训练的方法。例如,我们在VQAv2上以5.6%的绝对优势超越Flamingo \cite{Deepmind:Flamingo2022}。在极具挑战性的A-OKVQA数据集上,本方法甚至以高达20%的绝对优势超越少样本方法。