LLMs often produce fluent but incorrect answers, yet detecting such hallucinations typically requires multiple sampling passes or post-hoc verification, adding significant latency and cost. We hypothesize that intermediate layers encode confidence signals that are lost in the final output layer, and propose a lightweight probe to read these signals directly from hidden states. The probe adds less than 0.1\% computational overhead and can run fully in parallel with generation, enabling hallucination detection before the answer is produced. Building on this, we develop an LLM router that answers confident queries immediately while delegating uncertain ones to stronger models. Despite its simplicity, our method achieves SOTA AUROC on 10 out of 12 settings across four QA benchmarks and three LLM families, with gains of up to 13 points over prior methods, and generalizes across dataset shifts without retraining.
翻译:大型语言模型(LLM)常生成流畅但错误的答案,而检测此类幻觉通常需要多次采样或后验验证,导致显著的延迟和成本增加。我们假设中间层编码了在最终输出层丢失的置信度信号,并提出一种轻量级探针直接从隐藏状态读取这些信号。该探针增加的计算开销低于0.1%,且可与文本生成完全并行运行,从而在答案产生前实现幻觉检测。基于此,我们开发了一种LLM路由机制:对高置信度查询立即响应,同时将不确定查询委托给更强模型。尽管方法简单,我们的方法在四个QA基准测试和三个LLM家族的12种设置中,有10项达到最先进的AUROC指标,较现有方法提升最高达13个百分点,且无需重新训练即可适应数据集分布变化。