Developing generative models for interleaved image-text data has both research and practical value. It requires models to understand the interleaved sequences and subsequently generate images and text. However, existing attempts are limited by the issue that the fixed number of visual tokens cannot efficiently capture image details, which is particularly problematic in the multi-image scenarios. To address this, this paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data. It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context during the generation process. MM-Interleaved is end-to-end pre-trained on both paired and interleaved image-text corpora. It is further enhanced through a supervised fine-tuning phase, wherein the model improves its ability to follow complex multi-modal instructions. Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions. Code and models are available at \url{https://github.com/OpenGVLab/MM-Interleaved}.
翻译:开发交错图文数据的生成模型兼具研究价值与实际应用价值。这类模型需理解交错序列,并在此基础上生成图像与文本。然而,现有尝试受限于固定数量视觉标记无法高效捕捉图像细节的问题,在多图像场景中这一局限尤为突出。为解决此问题,本文提出MM-Interleaved——一种面向交错图文数据的端到端生成模型。该模型引入多尺度多图像特征同步器模块,使得生成过程中能够直接访问先前上下文中的细粒度图像特征。MM-Interleaved在成对及交错的图文语料库上进行端到端预训练,并通过监督微调阶段进一步增强能力,提升模型遵循复杂多模态指令的性能。实验表明,MM-Interleaved在遵循多模态指令识别视觉细节、以及根据文本与视觉条件生成一致性图像方面展现出广泛适用性。代码与模型已开源至 \url{https://github.com/OpenGVLab/MM-Interleaved}。