This paper introduces the ColorSwap dataset, designed to assess and improve the proficiency of multimodal models in matching objects with their colors. The dataset is comprised of 2,000 unique image-caption pairs, grouped into 1,000 examples. Each example includes a caption-image pair, along with a ``color-swapped'' pair. We follow the Winoground schema: the two captions in an example have the same words, but the color words have been rearranged to modify different objects. The dataset was created through a novel blend of automated caption and image generation with humans in the loop. We evaluate image-text matching (ITM) and visual language models (VLMs) and find that even the latest ones are still not robust at this task. GPT-4V and LLaVA score 72% and 42% on our main VLM metric, although they may improve with more advanced prompting techniques. On the main ITM metric, contrastive models such as CLIP and SigLIP perform close to chance (at 12% and 30%, respectively), although the non-contrastive BLIP ITM model is stronger (87%). We also find that finetuning on fewer than 2,000 examples yields significant performance gains on this out-of-distribution word-order understanding task. The dataset is here: https://github.com/Top34051/colorswap.
翻译:本文介绍ColorSwap数据集,旨在评估并提升多模态模型将物体与其颜色匹配的能力。该数据集包含2000个独特的图像-文本对,组成1000个示例。每个示例包含一个描述-图像对及其"颜色交换"对。我们遵循Winoground模式:示例中的两个描述包含相同的词汇,但颜色词被重新排列以修饰不同的物体。数据集通过一种新颖的自动化描述和图像生成与人类参与相结合的混合方法创建。我们评估了图像-文本匹配(ITM)模型和视觉语言模型(VLMs),发现即使是最新模型在此任务上仍不够稳健。GPT-4V和LLaVA在我们主要的VLM指标上分别获得72%和42%的分数,尽管使用更先进的提示技术可能有所提升。在主要ITM指标上,CLIP和SigLIP等对比模型的表现接近随机水平(分别为12%和30%),而非对比性的BLIP ITM模型表现更强(87%)。我们还发现,在少于2000个示例上进行微调即可在此类分布外词序理解任务上获得显著的性能提升。数据集地址:https://github.com/Top34051/colorswap。