Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions

Multimodal Large Language Models (MLLMs) have recently sparked significant interest, which demonstrates emergent capabilities to serve as a general-purpose model for various vision-language tasks. However, existing methods mainly focus on limited types of instructions with a single image as visual context, which hinders the widespread availability of MLLMs. In this paper, we introduce the I4 benchmark to comprehensively evaluate the instruction following ability on complicated interleaved vision-language instructions, which involve intricate image-text sequential context, covering a diverse range of scenarios (e.g., visually-rich webpages/textbooks, lecture slides, embodied dialogue). Systematic evaluation on our I4 benchmark reveals a common defect of existing methods: the Visual Prompt Generator (VPG) trained on image-captioning alignment objective tends to attend to common foreground information for captioning but struggles to extract specific information required by particular tasks. To address this issue, we propose a generic and lightweight controllable knowledge re-injection module, which utilizes the sophisticated reasoning ability of LLMs to control the VPG to conditionally extract instruction-specific visual information and re-inject it into the LLM. Further, we introduce an annotation-free cross-attention guided counterfactual image training strategy to methodically learn the proposed module by collaborating a cascade of foundation models. Enhanced by the proposed module and training strategy, we present Cheetor, a Transformer-based MLLM that can effectively handle a wide variety of interleaved vision-language instructions and achieves state-of-the-art zero-shot performance across all tasks of I4, without high-quality multimodal instruction tuning data. Cheetor also exhibits competitive performance compared with state-of-the-art instruction tuned models on MME benchmark.

翻译：多模态大语言模型（MLLMs）近期引发了广泛关注，展现出作为通用模型处理各类视觉语言任务的涌现能力。然而，现有方法主要聚焦于以单张图像为视觉上下文的有限指令类型，这制约了MLLMs的广泛应用。本文提出I4基准，系统评估模型在复杂交错视觉语言指令（涉及图像-文本序列上下文，涵盖丰富网页/教科书、教学幻灯片、具身对话等多类场景）下的指令遵循能力。基于I4基准的系统评估揭示了现有方法的共性缺陷：基于图像描述对齐目标训练的视觉提示生成器（VPG）会倾向于聚焦通用前景信息进行描述，但难以提取特定任务所需的具象信息。为解决该问题，我们提出一种通用且轻量级的可控知识再注入模块，利用大语言模型的强大推理能力控制VPG有条件地提取指令相关的视觉信息，并将其重新注入大语言模型。进一步，我们提出一种免标注的交叉注意力引导反事实图像训练策略，通过级联基础模型协作来系统化学习该模块。借助所提模块与训练策略，我们提出Cheetor——基于Transformer的多模态大语言模型，能有效处理各类交错视觉语言指令，在无高质量多模态指令微调数据条件下，即在I4全量任务中实现了最先进的零样本性能。该模型在MME基准上亦展现出与最先进指令微调模型相当的竞争力。