High-performance Multimodal Large Language Models (MLLMs) rely heavily on data quality. This study introduces a novel dataset named Img-Diff, designed to enhance fine-grained image recognition in MLLMs by leveraging insights from contrastive learning and image difference captioning. By analyzing object differences between similar images, we challenge models to identify both matching and distinct components. We utilize the Stable-Diffusion-XL model and advanced image editing techniques to create pairs of similar images that highlight object replacements. Our methodology includes a Difference Area Generator for object differences identifying, followed by a Difference Captions Generator for detailed difference descriptions. The result is a relatively small but high-quality dataset of "object replacement" samples. We use the the proposed dataset to finetune state-of-the-art (SOTA) MLLMs such as MGM-7B, yielding comprehensive improvements of performance scores over SOTA models that trained with larger-scale datasets, in numerous image difference and Visual Question Answering tasks. For instance, our trained models notably surpass the SOTA models GPT-4V and Gemini on the MMVP benchmark. Besides, we investigate alternative methods for generating image difference data through "object removal" and conduct a thorough evaluation to confirm the dataset's diversity, quality, and robustness, presenting several insights on the synthesis of such a contrastive dataset. To encourage further research and advance the field of multimodal data synthesis and enhancement of MLLMs' fundamental capabilities for image understanding, we release our codes and dataset at https://github.com/modelscope/data-juicer/tree/ImgDiff.
翻译:高性能多模态大语言模型(MLLMs)的性能高度依赖于数据质量。本研究提出了一种名为Img-Diff的新型数据集,旨在通过利用对比学习和图像差异描述中的洞见,增强MLLMs在细粒度图像识别方面的能力。通过分析相似图像之间的物体差异,我们促使模型识别匹配和不同的部分。我们利用Stable-Diffusion-XL模型和先进的图像编辑技术,创建了突出物体替换的相似图像对。我们的方法包括一个用于识别物体差异的差异区域生成器,以及一个用于生成详细差异描述的差异描述生成器。最终,我们得到了一个规模相对较小但高质量的“物体替换”样本数据集。我们使用所提出的数据集对最先进的MLLMs(如MGM-7B)进行微调,在多项图像差异和视觉问答任务中,其性能得分相较于使用更大规模数据集训练的SOTA模型获得了全面提升。例如,我们训练的模型在MMVP基准测试中显著超越了SOTA模型GPT-4V和Gemini。此外,我们探索了通过“物体移除”生成图像差异数据的替代方法,并进行了全面评估以确认数据集的多样性、质量与鲁棒性,提出了关于此类对比数据集合成的若干见解。为促进进一步研究并推动多模态数据合成领域的发展,以及增强MLLMs图像理解的基础能力,我们在https://github.com/modelscope/data-juicer/tree/ImgDiff 发布了我们的代码和数据集。