High-performance Multimodal Large Language Models (MLLMs) are heavily dependent on data quality. To advance fine-grained image recognition within MLLMs, we introduce a novel data synthesis method inspired by contrastive learning and image difference captioning. Our key idea involves challenging the model to discern both matching and distinct elements by scrutinizing object differences in detailed regions across similar images. We begin by generating pairs of similar images that emphasize object variations. Following this, we employ a Difference Area Generator to pinpoint object differences, and subsequently, a Difference Captions Generator to articulate these differences. This process results in a high-quality dataset of "object replacement" samples, termed Img-Diff, which can be scaled as needed due to its automated nature. We leverage this generated dataset to fine-tune state-of-the-art (SOTA) MLLMs, such as InternVL2, achieving substantial improvements across various image difference and Visual Question Answering tasks. Notably, the trained models significantly outperform existing SOTA models like GPT-4V and Gemini on the MMVP benchmark. Additionally, we conduct comprehensive evaluations to validate the dataset's diversity, quality, and robustness, offering several insights into the synthesis of such contrastive datasets. We release our codes and dataset to encourage further research on multimodal data synthesis and MLLMs' fundamental capabilities for image understanding.
翻译:高性能多模态大语言模型(MLLMs)的性能高度依赖于数据质量。为推进MLLMs中的细粒度图像识别,我们受对比学习与图像差异描述任务的启发,提出一种新颖的数据合成方法。其核心思想是通过分析相似图像在细节区域中的物体差异,迫使模型同时辨识匹配元素与区别特征。我们首先生成强调物体变化的相似图像对,随后利用差异区域生成器定位物体差异,并通过差异描述生成器对这些差异进行语言描述。该流程最终产出可规模化扩展的高质量“物体替换”样本数据集(称为Img-Diff)。我们使用该生成数据集对InternVL2等前沿MLLMs进行微调,在多种图像差异理解与视觉问答任务上取得显著提升。值得注意的是,经训练的模型在MMVP基准测试中显著超越GPT-4V、Gemini等现有最优模型。此外,我们通过系统化评估验证了数据集的多样性、质量与鲁棒性,并为对比数据集的合成提供了若干洞见。我们将公开代码与数据集,以促进多模态数据合成及MLLMs图像理解基础能力的进一步研究。