Hallucination in large language models (LLMs) can be understood as a failure of faithful readout: although internal representations may encode uncertainty about a query, decoding pressures still yield a fluent answer. We propose lightweight residual probes that read hallucination risk directly from intermediate hidden states of question tokens, motivated by the hypothesis that these layers retain epistemic signals that are attenuated in the final decoding stage. The probe is a small auxiliary network whose computation is orders of magnitude cheaper than token generation and can be evaluated fully in parallel with inference, enabling near-instantaneous hallucination risk estimation with effectively zero added latency in low-risk cases. We deploy the probe as an agentic critic for fast selective generation and routing, allowing LLMs to immediately answer confident queries while delegating uncertain ones to stronger verification pipelines. Across four QA benchmarks and multiple LLM families, the method achieves strong AUROC and AURAC, generalizes under dataset shift, and reveals interpretable structure in intermediate representations, positioning fast internal uncertainty readout as a principled foundation for reliable agentic AI.
翻译:大型语言模型(LLM)中的幻觉可理解为忠实读取失败:尽管内部表征可能编码了关于查询的不确定性,但解码压力仍会产生流畅的回答。我们提出一种轻量级残差探针,可直接从问题标记的中间隐藏状态读取幻觉风险,其动机在于假设这些层保留了在最终解码阶段被衰减的认知信号。该探针是一个小型辅助网络,其计算成本比标记生成低数个数量级,且可与推理过程完全并行评估,从而在低风险情况下以近乎零额外延迟实现瞬时幻觉风险估计。我们将该探针部署为一种智能体批判器,用于快速选择性生成与路由,使LLM能够立即回答高置信度查询,同时将不确定查询委托给更强的验证流程。在四个问答基准测试和多个LLM系列上的实验表明,该方法实现了优异的AUROC和AURAC指标,在数据集偏移下具有良好的泛化能力,并揭示了中间表征中可解释的结构,从而将快速内部不确定性读取定位为构建可靠智能体AI的理论基础。