Developing generative models for interleaved image-text data has both research and practical value. It requires models to understand the interleaved sequences and subsequently generate images and text. However, existing attempts are limited by the issue that the fixed number of visual tokens cannot efficiently capture image details, which is particularly problematic in the multi-image scenarios. To address this, this paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data. It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context during the generation process. MM-Interleaved is end-to-end pre-trained on both paired and interleaved image-text corpora. It is further enhanced through a supervised fine-tuning phase, wherein the model improves its ability to follow complex multi-modal instructions. Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions. Code and models are available at \url{https://github.com/OpenGVLab/MM-Interleaved}.
翻译:开发面向交错图文数据的生成模型兼具研究价值与实践意义,要求模型既能理解交错序列,又能依次生成图像与文本。然而现有方法受限于固定数量视觉标记难以高效捕获图像细节的问题,这在多图像场景中尤为突出。为此,本文提出MM-Interleaved——一种面向交错图文数据的端到端生成模型。该模型引入多尺度多图像特征同步模块,使得生成过程中能够直接访问前文中的细粒度图像特征。MM-Interleaved在配对与交错图文语料库上完成端到端预训练,并通过监督微调阶段进一步增强模型遵循复杂多模态指令的能力。实验证明,MM-Interleaved在识别遵循多模态指令的视觉细节以及生成符合图文条件的连贯图像方面具有泛化能力。代码与模型发布于\url{https://github.com/OpenGVLab/MM-Interleaved}。