Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images, making VLMs less effective in downstream vision-language tasks. In this paper, we address the limitation above by 1) introducing MMICL, a new approach to allow the VLM to deal with multi-modal inputs efficiently; 2) proposing a novel context scheme to augment the in-context learning ability of the VLM; 3) constructing the Multi-modal In-Context Learning (MIC) dataset, designed to enhance the VLM's ability to understand complex multi-modal prompts. Our experiments confirm that MMICL achieves new state-of-the-art zero-shot performance on a wide range of general vision-language tasks, especially for complex benchmarks, including MME and MMBench. Our analysis demonstrates that MMICL effectively tackles the challenge of complex multi-modal prompt understanding and emerges the impressive ICL ability. Furthermore, we observe that MMICL successfully alleviates language bias in VLMs, a common issue for VLMs that often leads to hallucination when faced with extensive textual context.
翻译:自深度学习复兴以来,由大型语言模型增强的视觉语言模型日益普及。然而,尽管大型语言模型能通过上下文学习利用广泛的背景知识和任务信息,但多数视觉语言模型仍难以理解包含多图像的复杂多模态提示,导致其在下游视觉语言任务中效果欠佳。本文通过以下三点解决上述局限:1)提出MMICL方法,使视觉语言模型高效处理多模态输入;2)设计新型上下文方案以增强视觉语言模型的上下文学习能力;3)构建多模态上下文学习数据集,旨在提升模型理解复杂多模态提示的能力。实验证明,MMICL在各类通用视觉语言任务(尤其是MME和MMBench等复杂基准测试)中取得了新的零样本最佳性能。分析表明,MMICL有效解决了复杂多模态提示理解的挑战,并展现出显著的上下文学习能力。此外,我们观察到MMICL成功缓解了视觉语言模型中的语言偏差——这一常见问题常导致模型在大量文本上下文中产生幻觉。