Vision-language models (VLMs) can answer image-based questions confidently, and often correctly, even when no image is provided. This mirage behavior inflates benchmark scores without reflecting visual grounding. Prior work treats this as a single failure mode. We argue it is two. Using Mirage Probes, a contrastive probing framework that pairs paraphrased question variants with matched mirage and non-mirage labels on the same image, we show that mirage behavior is linearly decodable from internal activations across residual stream, MLP, post-attention, and attention-head sites in two open-source VLMs. We demonstrate that a Naive Bayes text baseline cannot recover this signal, ruling out surface lexical confounds. Cross-benchmark separability patterns, together with a novel Prior Harnessing Index (PHI) measuring how much a model can answer from text alone, expose two distinct regimes: textual biases, where the model answers from language priors without engaging visual representations, and spurious images, where it constructs false visual content in latent space and answers as if grounded. The distinction has direct mitigation consequences: text-distribution cleaning can address the first regime but cannot reach the second, since spurious-image mirages live in the model's visual representations rather than its text. Faithful visual grounding will require interventions at the representational level.
翻译:视觉-语言模型(VLM)能够自信地、通常正确地回答基于图像的问题,即使在未提供任何图像时也是如此。这种幻象行为夸大了基准测试得分,却并未反映真实的视觉基础。以往研究将此视为一种单一失败模式,而我们认为其中包含两种。利用幻象探针(一种对比性探测框架,通过将改写的问题变体与同一图像上匹配的幻象标签和非幻象标签配对),我们证明在两个开源VLM中,幻象行为可从残差流、MLP、注意力后及注意力头部等位置的内部激活中线性解码。我们进一步展示,朴素贝叶斯文本基线无法恢复该信号,排除了浅层词汇混淆因素。跨基准的可分离性模式,结合新提出的先验利用指数(PHI,衡量模型仅从文本中可回答问题的程度),揭示了两种不同的机制:文本偏见机制(模型基于语言先验作答而不调用视觉表示)和虚假图像机制(模型在潜在空间中构建虚假视觉内容,并仿佛基于视觉证据作答)。两者的区别具有直接的缓解措施意义:文本分布清洗可解决第一种机制,却无法触及第二种机制,因为虚假图像幻象存在于模型的视觉表征而非文本中。真实的视觉基础需在表征层面进行干预。