Large vision-language models (VLMs) are highly capable, yet often hallucinate by favoring textual prompts over visual evidence. We study this failure mode in a controlled object-counting setting, where the prompt overstates the number of objects in the image (e.g., asking a model to describe four waterlilies when only three are present). At low object counts, models often correct the overestimation, but as the number of objects increases, they increasingly conform to the prompt regardless of the discrepancy. Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. Across models, PIH-heads mediate prompt copying in model-specific ways. We characterize these differences and show that PIH ablation increases correction toward visual evidence. Our findings offer insights into the internal mechanisms driving prompt-induced hallucinations, revealing model-specific differences in how these behaviors are implemented.
翻译:大型视觉语言模型(VLM)能力强大,但常因偏向文本提示而非视觉证据而产生幻觉。我们在受控物体计数场景中研究这种失效模式——当提示夸大图像中物体数量时(例如,要求模型描述四朵睡莲,而实际只有三朵)。在物体数量较少的情况下,模型常能纠正这种高估;但随着物体数量增加,模型逐渐顺从提示,忽略实际差异。通过对三种VLM的机制分析,我们识别出一小部分注意力头,将其消融后无需额外训练即可使提示诱导幻觉(PIH)显著减少至少40%。不同模型中,PIH头部以模型特定方式介导提示复制行为。我们表征了这些差异,并证明PIH消融能增强对视觉证据的纠正。本研究为揭示提示诱导幻觉的内部机制提供了新见解,揭示了这些行为在模型间的特异性实现差异。