Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliably they respond to controlled linguistic perturbations. We introduce Language-Guided Invariance Probing (LGIP), a benchmark that measures (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image-text matching. Using 40k MS COCO images with five human captions each, we automatically generate paraphrases and rule-based flips that alter object category, color or count, and summarize model behavior with an invariance error, a semantic sensitivity gap and a positive-rate statistic. Across nine VLMs, EVA02-CLIP and large OpenCLIP variants lie on a favorable invariance-sensitivity frontier, combining low paraphrase-induced variance with consistently higher scores for original captions than for their flipped counterparts. In contrast, SigLIP and SigLIP2 show much larger invariance error and often prefer flipped captions to the human descriptions, especially for object and color edits. These failures are largely invisible to standard retrieval metrics, indicating that LGIP provides a model-agnostic diagnostic for the linguistic robustness of VLMs beyond conventional accuracy scores.
翻译:近期视觉-语言模型(VLMs)如 CLIP、OpenCLIP、EVA02-CLIP 和 SigLIP 展现出强大的零样本性能,但其对受控语言扰动的响应可靠性尚不明确。本文提出语言引导不变性探测(LGIP)基准,用于衡量模型在图文匹配任务中:(i)对保持语义的释义的不变性;(ii)对改变语义的语义翻转的敏感性。基于包含 40k 张 MS COCO 图像(每张图像配有五条人工标注描述)的数据集,我们自动生成释义及基于规则的语义翻转(涉及物体类别、颜色或数量的修改),并通过不变性误差、语义敏感度差距和正例率统计量来总结模型行为。在九种 VLM 的测试中,EVA02-CLIP 与大型 OpenCLIP 变体处于较优的不变性-敏感性边界,既保持较低的释义引发变异,又在原始描述相较于其翻转版本上持续获得更高评分。相比之下,SigLIP 与 SigLIP2 表现出显著更高的不变性误差,且往往更倾向于翻转后的描述而非人工标注,尤其在物体与颜色修改方面。这些缺陷在标准检索指标中大多不可见,表明 LGIP 能够为 VLMs 的语言鲁棒性提供一种超越传统准确率指标的、模型无关的诊断方法。