Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code available at https://gitlab.com/folbaeni/multimodal-icl
翻译:大型语言模型在各类任务中展现了卓越性能,并具备快速习得新技能的能力,例如通过少量示例进行上下文学习。本研究提出了一个综合框架,用于探究大型多模态模型中的多模态上下文学习。我们选取了最先进的开源多模态模型(如IDEFICS、OpenFlamingo)及广泛的 multimodal 任务作为研究对象。研究揭示了若干值得关注的发现:(1)多模态上下文学习主要依赖文本驱动机制,图像模态的影响极小甚至为零。(2)当采用高级上下文学习策略(如RICES)时,其效果并未优于基于上下文示例多数投票的简单策略。此外,我们识别了多模态上下文学习在部署前需考虑的若干偏差与局限性。代码详见 https://gitlab.com/folbaeni/multimodal-icl