Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions

Multimodal Large Language Models (MLLMs) have recently sparked significant interest, which demonstrates emergent capabilities to serve as a general-purpose model for various vision-language tasks. However, existing methods mainly focus on limited types of instructions with a single image as visual context, which hinders the widespread availability of MLLMs. In this paper, we introduce the I4 benchmark to comprehensively evaluate the instruction following ability on complicated interleaved vision-language instructions, which involve intricate image-text sequential context, covering a diverse range of scenarios (e.g., visually-rich webpages/textbooks, lecture slides, embodied dialogue). Systematic evaluation on our I4 benchmark reveals a common defect of existing methods: the Visual Prompt Generator (VPG) trained on image-captioning alignment objective tends to attend to common foreground information for captioning but struggles to extract specific information required by particular tasks. To address this issue, we propose a generic and lightweight controllable knowledge re-injection module, which utilizes the sophisticated reasoning ability of LLMs to control the VPG to conditionally extract instruction-specific visual information and re-inject it into the LLM. Further, we introduce an annotation-free cross-attention guided counterfactual image training strategy to methodically learn the proposed module by collaborating a cascade of foundation models. Enhanced by the proposed module and training strategy, we present Cheetah, a MLLM that can effectively handle a wide variety of interleaved vision-language instructions and achieves state-of-the-art zero-shot performance across all tasks of I4, without high-quality multimodal instruction tuning data. Moreover, Cheetah also exhibits competitive performance compared with state-of-the-art instruction tuned models on concurrent MME benchmark.

翻译：多模态大语言模型（MLLMs）近期引发了广泛关注，展现出作为通用模型处理各类视觉-语言任务的涌现能力。然而，现有方法主要关注以单张图像为视觉上下文的有限指令类型，这限制了MLLMs的广泛应用。本文引入I4基准，系统评估模型在复杂交错视觉-语言指令上的遵循能力，此类指令涉及图文交织的序列上下文，覆盖多样化场景（如富视觉网页/教科书、讲座幻灯片、具身对话）。在I4基准上的系统评估揭示了现有方法的共性缺陷：基于图像字幕对齐目标训练的视觉提示生成器（VPG）倾向于关注字幕所需的通用前景信息，却难以提取特定任务所需的精准信息。为解决此问题，我们提出一种通用轻量级可控知识再注入模块，利用LLM的复杂推理能力控制VPG按需提取指令特异性视觉信息，并将其重新注入LLM。进一步，我们提出无需标注的跨注意力引导反事实图像训练策略，通过级联基础模型协作，系统性地学习该模块。依托所提模块与训练策略，我们推出Cheetah——一种可有效处理多种交错视觉-语言指令的多模态大语言模型，在无需高质量多模态指令微调数据的情况下，于I4全部任务中实现最先进的零样本性能。此外，Cheetah在同期MME基准测试中与经过指令微调的最先进模型相比亦展现出竞争性表现。