Current multimodal models leveraging contrastive learning often face limitations in developing fine-grained conceptual understanding. This is due to random negative samples during pretraining, causing almost exclusively very dissimilar concepts to be compared in the loss function. Consequently, the models struggle with fine-grained semantic differences. To address this problem, we introduce a novel pretraining method incorporating synthetic hard negative text examples. The hard negatives permute terms corresponding to visual concepts, leading to a more fine-grained visual and textual concept alignment. Further, we introduce InpaintCOCO, a new challenging dataset for assessing the fine-grained alignment of colors, objects, and sizes in vision-language models. We created the dataset using generative inpainting from COCO images by changing the visual concepts so that the images no longer match their original captions. Our results show significant improvements in fine-grained concept understanding across a wide range of vision-language datasets, including our InpaintCOCO dataset.
翻译:当前基于对比学习的多模态模型在培养细粒度概念理解方面常面临局限。这是由于预训练过程中的随机负样本导致损失函数中几乎只对极不相似的概念进行对比,使得模型难以处理细粒度语义差异。为解决此问题,我们提出一种新型预训练方法,该方法引入合成的硬负样本文本示例。这些硬负样本通过置换与视觉概念对应的术语,实现了更细粒度的视觉与文本概念对齐。此外,我们提出了InpaintCOCO——一个用于评估视觉语言模型中颜色、物体和尺寸细粒度对齐能力的新挑战性数据集。该数据集基于COCO图像,通过生成式图像修复技术改变视觉概念,使图像不再匹配原始描述。实验结果表明,该方法在包括InpaintCOCO数据集在内的广泛视觉语言数据集上,显著提升了细粒度概念理解能力。