Motivated by in-context learning (ICL) capabilities of Large Language models (LLMs), multimodal LLMs with additional visual modality are also exhibited with similar ICL abilities when multiple image-text pairs are provided as demonstrations. However, relatively less work has been done to investigate the principles behind how and why multimodal ICL works. We conduct a systematic and principled evaluation of multimodal ICL for models of different scales on a broad spectrum of new yet critical tasks. Through perturbations over different modality information, we show that modalities matter differently across tasks in multimodal ICL. Guided by task-specific modality impact, we recommend modality-driven demonstration strategies to boost ICL performance. We also find that models may follow inductive biases from multimodal ICL even if they are rarely seen in or contradict semantic priors from pretraining data. Our principled analysis provides a comprehensive way of understanding the role of demonstrations in multimodal in-context learning, and sheds light on effectively improving multimodal ICL on a wide range of tasks.
翻译:受大型语言模型(LLM)的上下文学习(ICL)能力启发,具备额外视觉模态的多模态LLM在提供多组图文对作为演示时,也展现出类似的ICL能力。然而,针对多模态ICL如何及为何有效的原理性研究相对较少。本研究对多模态ICL进行了系统化、规范化的评估,涵盖不同规模模型在广泛新兴关键任务上的表现。通过对不同模态信息施加扰动,我们发现多模态ICL中各模态的重要性随任务类型呈现显著差异。基于任务特定的模态影响规律,我们提出模态驱动的演示策略以提升ICCL性能。研究还发现,即使与预训练数据中的语义先验相矛盾或罕见出现,模型仍可能遵循多模态ICL引发的归纳偏好。我们的规范分析为理解演示在多模态上下文学习中的作用提供了系统性框架,并为在广泛任务中有效改进多模态ICL指明了方向。