Auto-Comp: An Automated Pipeline for Scalable Compositional Probing of Contrastive Vision-Language Models

Modern Vision-Language Models (VLMs) exhibit a critical flaw in compositional reasoning, often confusing "a red cube and a blue sphere" with "a blue cube and a red sphere". Disentangling the visual and linguistic roots of these failures is a fundamental challenge for robust evaluation. To enable fine-grained, controllable analysis, we introduce Auto-Comp, a fully automated and synthetic pipeline for generating scalable benchmarks. Its controllable nature is key to dissecting and isolating different reasoning skills. Auto-Comp generates paired images from Minimal (e.g., "a monitor to the left of a bicycle on a white background") and LLM-generated Contextual captions (e.g., "In a brightly lit photography studio, a monitor is positioned to the left of a bicycle"), allowing a controlled A/B test to disentangle core binding ability from visio-linguistic complexity. Our evaluation of 20 VLMs on novel benchmarks for color binding and spatial relations reveals universal compositional failures in both CLIP and SigLIP model families. Crucially, our novel "Confusion Benchmark" reveals a deeper flaw beyond simple attribute swaps: models are highly susceptible to low-entropy distractors (e.g., repeated objects or colors), demonstrating their compositional failures extend beyond known bag-of-words limitations. we uncover a surprising trade-off: visio-linguistic context, which provides global scene cues, aids spatial reasoning but simultaneously hinders local attribute binding by introducing visual clutter. We release the Auto-Comp pipeline to facilitate future benchmark creation, alongside all our generated benchmarks (https://huggingface.co/AutoComp).

翻译：现代视觉-语言模型在组合推理方面存在关键缺陷，常将"红色立方体和蓝色球体"与"蓝色立方体和红色球体"混淆。厘清这些失效现象的视觉与语言根源，是进行稳健评估的根本挑战。为实现细粒度、可控的分析，我们提出了Auto-Comp——一个用于生成可扩展评测基准的全自动化合成流水线。其可控特性是解构与隔离不同推理能力的关键。Auto-Comp通过最小化描述（例如"白色背景上位于自行车左侧的显示器"）与LLM生成的上下文描述（例如"在光线明亮的摄影棚中，显示器被放置在自行车左侧"）生成配对图像，从而通过受控的A/B测试分离核心绑定能力与视觉-语言复杂性。我们在颜色绑定和空间关系的新基准上评估了20个视觉-语言模型，发现CLIP和SigLIP模型家族普遍存在组合推理失效。关键的是，我们新颖的"混淆基准"揭示了超越简单属性交换的更深层缺陷：模型极易受到低熵干扰项（例如重复物体或颜色）的影响，表明其组合性失效已超越已知的词袋模型局限。我们发现了一个令人惊讶的权衡：提供全局场景线索的视觉-语言上下文虽有助于空间推理，但同时会因引入视觉干扰而阻碍局部属性绑定。我们开源了Auto-Comp流水线以促进未来基准创建，并同步发布了所有生成基准（https://huggingface.co/AutoComp）。