Several benchmarks have concluded that our best vision-language models (e.g., CLIP) are lacking in compositionality. Given an image, these benchmarks probe a model's ability to identify its associated caption amongst a set of compositional distractors. In response, a surge of recent proposals show improvements by finetuning CLIP with distractors as hard negatives. Our investigations reveal that these improvements have, in fact, been significantly overstated -- because existing benchmarks do not probe whether finetuned vision-language models remain invariant to hard positives. By curating an evaluation dataset with 112,382 hard negatives and hard positives, we uncover that including hard positives decreases CLIP's performance by 12.9%, while humans perform effortlessly at 99%. CLIP finetuned with hard negatives results in an even larger decrease, up to 38.7%. With this finding, we then produce a 1,775,259 image-text training set with both hard negative and hard positive captions. By training with both, we see improvements on existing benchmarks while simultaneously improving performance on hard positives, indicating a more robust improvement in compositionality. Our work suggests the need for future research to rigorously test and improve CLIP's understanding of semantic relationships between related "positive" concepts.
翻译:多项基准测试表明,我们现有的最佳视觉-语言模型(如CLIP)在组合性方面存在不足。这些基准测试通过给定图像,探究模型从一组组合性干扰项中识别对应标题的能力。作为回应,近期涌现的改进方案通过将干扰项作为硬负样本对CLIP进行微调,显示出性能提升。然而,我们的研究发现这些改进实际上被显著高估——因为现有基准未能检测微调后的视觉-语言模型是否对硬正样本保持不变性。通过构建包含112,382个硬负样本与硬正样本的评估数据集,我们发现引入硬正样本会使CLIP性能下降12.9%,而人类在此任务中能达到99%的准确率。使用硬负样本微调的CLIP性能下降更为显著,最高达38.7%。基于此发现,我们构建了包含1,775,259个图像-文本对的训练集,其中同时包含硬负样本与硬正样本标题。通过联合训练,模型在现有基准测试中取得改进的同时,在硬正样本上的性能也得到提升,这表明模型在组合性方面获得了更稳健的改进。我们的研究表明,未来研究需要严格测试并改进CLIP对相关"正样本"概念间语义关系的理解能力。