Interleaved image-text generation has emerged as a crucial multimodal task, aiming at creating sequences of interleaved visual and textual content given a query. Despite notable advancements in recent multimodal large language models (MLLMs), generating integrated image-text sequences that exhibit narrative coherence and entity and style consistency remains challenging due to poor training data quality. To address this gap, we introduce CoMM, a high-quality Coherent interleaved image-text MultiModal dataset designed to enhance the coherence, consistency, and alignment of generated multimodal content. Initially, CoMM harnesses raw data from diverse sources, focusing on instructional content and visual storytelling, establishing a foundation for coherent and consistent content. To further refine the data quality, we devise a multi-perspective filter strategy that leverages advanced pre-trained models to ensure the development of sentences, consistency of inserted images, and semantic alignment between them. Various quality evaluation metrics are designed to prove the high quality of the filtered dataset. Meanwhile, extensive few-shot experiments on various downstream tasks demonstrate CoMM's effectiveness in significantly enhancing the in-context learning capabilities of MLLMs. Moreover, we propose four new tasks to evaluate MLLMs' interleaved generation abilities, supported by a comprehensive evaluation framework. We believe CoMM opens a new avenue for advanced MLLMs with superior multimodal in-context learning and understanding ability.
翻译:交错图文生成已成为一项关键的多模态任务,其目标是根据给定查询创建交错排列的视觉与文本内容序列。尽管近年来多模态大语言模型(MLLMs)取得了显著进展,但由于训练数据质量不佳,生成具有叙事连贯性、实体与风格一致性的集成图文序列仍然面临挑战。为弥补这一不足,我们提出了CoMM——一个高质量的连贯交错图文多模态数据集,旨在提升生成多模态内容的连贯性、一致性与对齐性。首先,CoMM利用来自多样化来源的原始数据,重点关注教学类内容和视觉叙事,为构建连贯且一致的内容奠定基础。为进一步提升数据质量,我们设计了一种多视角过滤策略,该策略利用先进的预训练模型来确保句子的流畅性、插入图像的一致性以及两者间的语义对齐。我们设计了多种质量评估指标以验证过滤后数据集的高质量特性。同时,在多种下游任务上进行的大量少样本实验表明,CoMM能有效显著增强MLLMs的上下文学习能力。此外,我们提出了四项新任务来评估MLLMs的交错生成能力,并辅以一套综合评估框架。我们相信CoMM为开发具有卓越多模态上下文学习与理解能力的高级MLLMs开辟了新途径。