Hallucinations remain a persistent challenge for vision-language models (VLMs), which often describe nonexistent objects or fabricate facts. Existing detection methods typically operate after text generation, making intervention both costly and untimely. We investigate whether hallucination risk can instead be predicted before any token is generated by probing a model's internal representations in a single forward pass. Across a diverse set of vision-language tasks and eight modern VLMs, including Llama-3.2-Vision, Gemma-3, Phi-4-VL, and Qwen2.5-VL, we examine three families of internal representations: (i) visual-only features without multimodal fusion, (ii) vision-token representations within the text decoder, and (iii) query-token representations that integrate visual and textual information before generation. Probes trained on these representations achieve strong hallucination-detection performance without decoding, reaching up to 0.93 AUROC on Gemma-3-12B, Phi-4-VL 5.6B, and Molmo 7B. Late query-token states are the most predictive for most models, while visual or mid-layer features dominate in a few architectures (e.g., ~0.79 AUROC for Qwen2.5-VL-7B using visual-only features). These results demonstrate that (1) hallucination risk is detectable pre-generation, (2) the most informative layer and modality vary across architectures, and (3) lightweight probes have the potential to enable early abstention, selective routing, and adaptive decoding to improve both safety and efficiency.
翻译:幻觉问题仍然是视觉语言模型面临的一个持续挑战,这些模型常常描述不存在的对象或捏造事实。现有的检测方法通常在文本生成后运行,使得干预既昂贵又不及时。我们研究是否可以在生成任何词元之前,通过单次前向传播探查模型的内部表征来预测幻觉风险。在一系列多样化的视觉语言任务和八个现代视觉语言模型(包括Llama-3.2-Vision、Gemma-3、Phi-4-VL和Qwen2.5-VL)上,我们考察了三类内部表征:(i)未经多模态融合的纯视觉特征,(ii)文本解码器内的视觉词元表征,以及(iii)在生成前整合视觉与文本信息的查询词元表征。基于这些表征训练的探查器无需解码即可实现强大的幻觉检测性能,在Gemma-3-12B、Phi-4-VL 5.6B和Molmo 7B上AUROC最高可达0.93。对于大多数模型,后期的查询词元状态最具预测性,而在少数架构中(例如Qwen2.5-VL-7B使用纯视觉特征时AUROC约为0.79),视觉或中间层特征占主导地位。这些结果表明:(1)幻觉风险可在生成前被检测到,(2)最具信息量的层和模态因架构而异,(3)轻量级探查器有望实现早期弃权、选择性路由和自适应解码,从而提升安全性和效率。