MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Starting from the resurgence of deep learning, vision-language models (VLMs) benefiting from large language models (LLMs) have never been so popular. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images. The issue can traced back to the architectural design of VLMs or pre-training data. Specifically, the current VLMs primarily emphasize utilizing multi-modal data with a single image some, rather than multi-modal prompts with interleaved multiple images and text. Even though some newly proposed VLMs could handle user prompts with multiple images, pre-training data does not provide more sophisticated multi-modal prompts than interleaved image and text crawled from the web. We propose MMICL to address the issue by considering both the model and data perspectives. We introduce a well-designed architecture capable of seamlessly integrating visual and textual context in an interleaved manner and MIC dataset to reduce the gap between the training data and the complex user prompts in real-world applications, including: 1) multi-modal context with interleaved images and text, 2) textual references for each image, and 3) multi-image data with spatial, logical, or temporal relationships. Our experiments confirm that MMICL achieves new stat-of-the-art zero-shot and few-shot performance on a wide range of general vision-language tasks, especially for complex reasoning benchmarks including MME and MMBench. Our analysis demonstrates that MMICL effectively deals with the challenge of complex multi-modal prompt understanding. The experiments on ScienceQA-IMG also show that MMICL successfully alleviates the issue of language bias in VLMs, which we believe is the reason behind the advanced performance of MMICL.

翻译：从深度学习的复兴开始，受益于大型语言模型的视觉语言模型从未如此流行。然而，尽管大型语言模型能利用上下文学习中的广泛背景知识和任务信息，大多数视觉语言模型在处理包含多张图像的复杂多模态提示时仍存在困难。这一问题可追溯至视觉语言模型的架构设计或预训练数据。具体而言，当前视觉语言模型主要强调利用包含单张图像的多模态数据，而非交错排列的多张图像与文本组成的多模态提示。即使部分新提出的视觉语言模型能处理含多张图像的用户提示，预训练数据提供的多模态提示仍不及从网络抓取的交错图像与文本复杂。我们提出MMICL，从模型和数据两个角度解决此问题。我们设计了一种能无缝整合交错的视觉与文本上下文的架构，并构建了MIC数据集，以缩小训练数据与真实应用中复杂用户提示之间的差距，具体包括：1）图像与文本交错的多模态上下文，2）每张图像的文本参考，3）具有空间、逻辑或时序关系的多图像数据。实验证明，MMICL在广泛的通用视觉语言任务（尤其是复杂推理基准MME和MMBench）上实现了新的最优零样本与少样本性能。分析表明，MMICL有效应对了复杂多模态提示理解的挑战。针对ScienceQA-IMG的实验还显示，MMICL成功缓解了视觉语言模型中的语言偏差问题，我们认为这正是MMICL性能提升的原因。