Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models

Vision-Language Models (VLMs) frequently "hallucinate" - generate plausible yet factually incorrect statements - posing a critical barrier to their trustworthy deployment. In this work, we propose a new paradigm for diagnosing hallucinations, recasting them from static output errors into dynamic pathologies of a model's computational cognition. Our framework is grounded in a normative principle of computational rationality, allowing us to model a VLM's generation as a dynamic cognitive trajectory. We design a suite of information-theoretic probes that project this trajectory onto an interpretable, low-dimensional Cognitive State Space. Our central discovery is a governing principle we term the geometric-information duality: a cognitive trajectory's geometric abnormality within this space is fundamentally equivalent to its high information-theoretic surprisal. Hallucination detection is counts as a geometric anomaly detection problem. Evaluated across diverse settings - from rigorous binary QA (POPE) and comprehensive reasoning (MME) to unconstrained open-ended captioning (MS-COCO) - our framework achieves state-of-the-art performance. Crucially, it operates with high efficiency under weak supervision and remains highly robust even when calibration data is heavily contaminated. This approach enables a causal attribution of failures, mapping observable errors to distinct pathological states: perceptual instability (measured by Perceptual Entropy), logical-causal failure (measured by Inferential Conflict), and decisional ambiguity (measured by Decision Entropy). Ultimately, this opens a path toward building AI systems whose reasoning is transparent, auditable, and diagnosable by design.

翻译：视觉语言模型（VLMs）常出现“幻觉”——生成看似合理但事实错误的陈述——这对其可信部署构成关键障碍。本研究提出一种诊断幻觉的新范式，将其从静态输出错误重新定义为模型计算认知的动态病理现象。我们的框架基于计算理性原则，将VLM的生成过程建模为动态认知轨迹。我们设计了一套信息论探针，将该轨迹投影至可解释的低维认知状态空间。核心发现是我们称为几何-信息对偶的支配原则：认知轨迹在该空间中的几何异常性本质上等同于其高信息论惊异值。幻觉检测由此转化为几何异常检测问题。在多样化场景评估中——从严谨的二元问答（POPE）与综合推理（MME）到无约束开放式描述（MS-COCO）——本框架均取得最先进性能。关键的是，该框架在弱监督下高效运行，且在校准数据严重污染时仍保持高度鲁棒性。此方法支持对故障进行因果归因，将可观测错误映射至不同病理状态：感知不稳定性（通过感知熵度量）、逻辑因果失效（通过推理冲突度量）以及决策模糊性（通过决策熵度量）。最终，这为构建具有透明、可审计且可诊断推理能力的人工智能系统开辟了道路。