Large language models (LLMs) have achieved state-of-the-art results in many natural language processing tasks. They have also demonstrated ability to adapt well to different tasks through zero-shot or few-shot settings. With the capability of these LLMs, researchers have looked into how to adopt them for use with Visual Question Answering (VQA). Many methods require further training to align the image and text embeddings. However, these methods are computationally expensive and requires large scale image-text dataset for training. In this paper, we explore a method of combining pretrained LLMs and other foundation models without further training to solve the VQA problem. The general idea is to use natural language to represent the images such that the LLM can understand the images. We explore different decoding strategies for generating textual representation of the image and evaluate their performance on the VQAv2 dataset.
翻译:大语言模型(LLMs)已在众多自然语言处理任务中取得最先进成果,并展现出通过零样本或少样本设置适应不同任务的能力。基于这些大语言模型的强大能力,研究者开始探索如何将其应用于视觉问答(Visual Question Answering, VQA)任务。现有方法大多需要通过额外训练来对齐图像与文本的嵌入表示,但这类方法计算成本高昂,且需要大规模图像-文本数据集进行训练。本文探索了一种无需额外训练即可结合预训练LLMs与其他基础模型解决VQA问题的方法。核心思想在于:通过自然语言表征图像,使LLM能够理解图像内容。我们研究了生成图像文本表征的不同解码策略,并在VQAv2数据集上评估其性能。