Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

High-performance Multimodal Large Language Models (MLLMs) rely heavily on data quality. This study introduces a novel dataset named Img-Diff, designed to enhance fine-grained image recognition in MLLMs by leveraging insights from contrastive learning and image difference captioning. By analyzing object differences between similar images, we challenge models to identify both matching and distinct components. We utilize the Stable-Diffusion-XL model and advanced image editing techniques to create pairs of similar images that highlight object replacements. Our methodology includes a Difference Area Generator for object differences identifying, followed by a Difference Captions Generator for detailed difference descriptions. The result is a relatively small but high-quality dataset of "object replacement" samples. We use the the proposed dataset to finetune state-of-the-art (SOTA) MLLMs such as MGM-7B, yielding comprehensive improvements of performance scores over SOTA models that trained with larger-scale datasets, in numerous image difference and Visual Question Answering tasks. For instance, our trained models notably surpass the SOTA models GPT-4V and Gemini on the MMVP benchmark. Besides, we investigate alternative methods for generating image difference data through "object removal" and conduct a thorough evaluation to confirm the dataset's diversity, quality, and robustness, presenting several insights on the synthesis of such a contrastive dataset. To encourage further research and advance the field of multimodal data synthesis and enhancement of MLLMs' fundamental capabilities for image understanding, we release our codes and dataset at https://github.com/modelscope/data-juicer/tree/ImgDiff.

翻译：高性能多模态大语言模型（MLLMs）的性能高度依赖于数据质量。本研究提出了一种名为Img-Diff的新型数据集，旨在通过利用对比学习和图像差异描述中的洞见，增强MLLMs在细粒度图像识别方面的能力。通过分析相似图像之间的物体差异，我们促使模型识别匹配和不同的部分。我们利用Stable-Diffusion-XL模型和先进的图像编辑技术，创建了突出物体替换的相似图像对。我们的方法包括一个用于识别物体差异的差异区域生成器，以及一个用于生成详细差异描述的差异描述生成器。最终，我们得到了一个规模相对较小但高质量的“物体替换”样本数据集。我们使用所提出的数据集对最先进的MLLMs（如MGM-7B）进行微调，在多项图像差异和视觉问答任务中，其性能得分相较于使用更大规模数据集训练的SOTA模型获得了全面提升。例如，我们训练的模型在MMVP基准测试中显著超越了SOTA模型GPT-4V和Gemini。此外，我们探索了通过“物体移除”生成图像差异数据的替代方法，并进行了全面评估以确认数据集的多样性、质量与鲁棒性，提出了关于此类对比数据集合成的若干见解。为促进进一步研究并推动多模态数据合成领域的发展，以及增强MLLMs图像理解的基础能力，我们在https://github.com/modelscope/data-juicer/tree/ImgDiff 发布了我们的代码和数据集。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日