MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images, making VLMs less effective in downstream vision-language tasks. In this paper, we address the limitation above by 1) introducing vision-language Model with Multi-Modal In-Context Learning(MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently; 2) proposing a novel context scheme to augment the in-context learning ability of the VLM; 3) constructing the Multi-modal In-Context Learning (MIC) dataset, designed to enhance the VLM's ability to understand complex multi-modal prompts. Our experiments confirm that MMICL achieves new state-of-the-art zero-shot performance on a wide range of general vision-language tasks, especially for complex benchmarks, including MME and MMBench. Our analysis demonstrates that MMICL effectively tackles the challenge of complex multi-modal prompt understanding and emerges the impressive ICL ability. Furthermore, we observe that MMICL successfully alleviates language bias in VLMs, a common issue for VLMs that often leads to hallucination when faced with extensive textual context. Our code, dataset, dataset tool, and model are available at https://github.com/PKUnlp-icler/MIC

翻译：自深度学习复兴以来，由大语言模型增强的视觉语言模型在流行度上呈指数级增长。然而，尽管大语言模型能够利用广泛的背景知识和任务信息进行上下文学习，大多数视觉语言模型在处理包含多张图像的复杂多模态提示时仍存在困难，导致其在下游视觉语言任务中效果不佳。本文针对上述局限性进行了改进：1）提出了具有多模态上下文学习能力的视觉语言模型（MMICL），这是一种使视觉语言模型高效处理多模态输入的新方法；2）提出了一种新颖的上下文方案以增强视觉语言模型的上下文学习能力；3）构建了多模态上下文学习数据集，旨在提升视觉语言模型理解复杂多模态提示的能力。实验证明，MMICL在多种通用视觉语言任务上取得了新的零样本最佳性能，尤其是在包括MME和MMBbench在内的复杂基准测试中。我们的分析表明，MMICL有效解决了复杂多模态提示理解的挑战，并展现出令人印象深刻的上下文学习能力。此外，我们观察到MMICL成功缓解了视觉语言模型的语言偏差——这一常见问题常导致模型在面对大量文本上下文时产生幻觉。我们的代码、数据集、数据处理工具及模型均已开源，地址为https://github.com/PKUnlp-icler/MIC

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日