CREPE: Can Vision-Language Foundation Models Reason Compositionally?

A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that: across 7 architectures trained with 4 algorithms on massive datasets, they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of a test dataset containing over $370K$ image-text pairs and three different seen-unseen splits. The three splits are designed to test models trained on three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also generate $325K$, $316K$, and $309K$ hard negative captions for a subset of the pairs. To test productivity, CREPE contains $17K$ image-text pairs with nine different complexities plus $183K$ hard negative captions with atomic, swapping and negation foils. The datasets are generated by repurposing the Visual Genome scene graphs and region descriptions and applying handcrafted templates and GPT-3. For systematicity, we find that model performance decreases consistently when novel compositions dominate the retrieval set, with Recall@1 dropping by up to $12\%$. For productivity, models' retrieval success decays as complexity increases, frequently nearing random chance at high complexity. These results hold regardless of model and training dataset size.

翻译：人类视觉与自然语言的一个基本共同特征是其组合性。然而，尽管大规模视觉与语言预训练带来了性能提升，我们发现：在7种架构上使用4种算法在庞大数据集上训练的模型，在组合性方面仍存在困难。为得出这一结论，我们提出了新的组合性评估基准CREPE，它衡量了认知科学文献所识别的组合性两个重要方面：系统性与生成性。为衡量系统性，CREPE包含一个包含超过37万对图像-文本对的测试数据集，以及三种不同的"所见-未见"划分。这三种划分旨在测试在三种常用训练数据集（CC-12M、YFCC-15M和LAION-400M）上训练的模型。我们还为部分子集生成了32.5万、31.6万和30.9万个难负例描述。为衡量生成性，CREPE包含1.7万对具有九种不同复杂度的图像-文本对，以及18.3万个包含原子、交换和否定干扰的难负例描述。这些数据集通过重新利用Visual Genome场景图和区域描述，并应用手工模板和GPT-3生成。在系统性方面，我们发现当检索集中新组合占主导时，模型性能持续下降，Recall@1指标下降高达12%。在生成性方面，随着复杂度增加，模型的检索成功率衰减，在高复杂度下常接近随机水平。这些结果与模型和训练数据集规模无关。