Two approaches have emerged to input images into large language models (LLMs). The first is to caption images into natural language. The second is to map image feature embeddings into the domain of the LLM and pass the mapped embeddings directly to the LLM. The majority of recent few-shot multimodal work reports performance using architectures that employ variations of one of these two approaches. But they overlook an important comparison between them. We design a controlled and focused experiment to compare these two approaches to few-shot visual question answering (VQA) with LLMs. Our findings indicate that for Flan-T5 XL, a 3B parameter LLM, connecting visual embeddings directly to the LLM embedding space does not guarantee improved performance over using image captions. In the zero-shot regime, we find using textual image captions is better. In the few-shot regimes, how the in-context examples are selected determines which is better.
翻译:将图像输入大语言模型(LLMs)存在两种主流方法。第一种是将图像转化为自然语言描述(即图像描述生成),第二种是将图像特征嵌入映射至大语言模型表征空间,并将映射后的嵌入直接输入大语言模型。近期多数小样本多模态研究采用这两种方法变体的架构报告性能,但忽略了二者间的重要对比。我们设计了一项受控聚焦实验,系统比较了这两种方法在大语言模型小样本视觉问答(VQA)任务中的表现。实验结果表明,对于参数规模为3B的Flan-T5 XL模型,直接将视觉嵌入连接至大语言模型嵌入空间并不保证优于使用图像描述的方法。在零样本设定下,使用文本形式的图像描述表现更优;而在小样本设定中,上下文示例的选择方式决定了哪种方法更具优势。