Large language models (LLMs) famously exhibit emergent in-context learning (ICL) -- the ability to rapidly adapt to new tasks using few-shot examples provided as a prompt, without updating the model's weights. Built on top of LLMs, vision large language models (VLLMs) have advanced significantly in areas such as recognition, reasoning, and grounding. However, investigations into \emph{multimodal ICL} have predominantly focused on few-shot visual question answering (VQA), and image captioning, which we will show neither exploit the strengths of ICL, nor test its limitations. The broader capabilities and limitations of multimodal ICL remain under-explored. In this study, we introduce a comprehensive benchmark VL-ICL Bench for multimodal in-context learning, encompassing a broad spectrum of tasks that involve both images and text as inputs and outputs, and different types of challenges, from {perception to reasoning and long context length}. We evaluate the abilities of state-of-the-art VLLMs against this benchmark suite, revealing their diverse strengths and weaknesses, and showing that even the most advanced models, such as GPT-4, find the tasks challenging. By highlighting a range of new ICL tasks, and the associated strengths and limitations of existing models, we hope that our dataset will inspire future work on enhancing the in-context learning capabilities of VLLMs, as well as inspire new applications that leverage VLLM ICL. The code and dataset are available at https://github.com/ys-zong/VL-ICL.
翻译:大型语言模型(LLM)以其涌现的上下文学习(ICL)能力而闻名——即能够通过提示中提供的少量示例快速适应新任务,而无需更新模型权重。基于LLM构建的视觉大语言模型(VLLM)在识别、推理和接地等领域取得了显著进展。然而,目前对**多模态ICL**的研究主要集中在少样本视觉问答(VQA)和图像描述生成上,我们将证明这两类任务既未能充分利用ICL的优势,也未能检验其局限性。多模态ICL更广泛的能力与局限仍有待深入探索。本研究引入了一个全面的多模态上下文学习基准VL-ICL Bench,涵盖涉及图像和文本作为输入与输出的广泛任务类型,以及从**感知到推理和长上下文**等不同维度的挑战。我们基于此基准套件评估了前沿VLLM的能力,揭示了它们多样化的优势与弱点,并表明即使是GPT-4等最先进的模型也认为这些任务具有挑战性。通过提出一系列新的ICL任务并揭示现有模型的相关优势与局限,我们希望本数据集能激励未来研究提升VLLM的上下文学习能力,并启发利用VLLM ICL的新应用。代码与数据集发布于 https://github.com/ys-zong/VL-ICL。