Large language models can follow complex procedures yet fail at a seemingly trivial final step: reporting a value they themselves computed moments earlier. We study this phenomenon as \emph{procedural hallucination}: failure to execute a verifiable, prompt-grounded specification even when the correct value is present in context. In long-context binding tasks with a known single-token candidate set, we find that many errors are readout-stage routing failures. Specifically, failures decompose into Stage~2A (gating) errors, where the model does not enter answer mode, and Stage~2B (binding) errors, where it enters answer mode but selects the wrong candidate (often due to recency bias). In the hard regime, Stage~2B accounts for most errors across model families in our tasks (Table~1). On Stage~2B error trials, a linear probe on the final-layer residual stream recovers the correct value far above chance (e.g., 74\% vs.\ 2\% on Qwen2.5-3B; Table~2), indicating that the answer is encoded but not used. We formalize ``present but not used'' via available vs.\ used mutual information and pseudo-prior interventions, yielding output-computable diagnostics and information-budget certificates. Finally, an oracle checkpointing intervention that restates the true binding near the query can nearly eliminate Stage~2B failures at long distance (e.g., Qwen2.5-3B $0/400 \rightarrow 399/400$ at $k = 1024$; Table~8).
翻译:大型语言模型能够遵循复杂程序,却在看似微不足道的最终步骤上失败:报告它们片刻前自行计算出的数值。我们将此现象称为**程序性幻觉**:即使在上下文中存在正确数值的情况下,仍无法执行可验证的、基于提示的规范。在具有已知单令牌候选集的长上下文绑定任务中,我们发现许多错误属于读取阶段的路由故障。具体而言,故障可分解为阶段2A(门控)错误(模型未进入应答模式)和阶段2B(绑定)错误(模型进入应答模式但选择了错误候选,通常由近因偏差导致)。在困难机制中,阶段2B错误在我们任务中跨模型家族占据了大部分错误(表1)。在阶段2B错误试验中,对最终层残差流的线性探针恢复正确数值的准确率远超随机水平(例如Qwen2.5-3B达到74% vs. 2%;表2),表明答案已被编码但未被使用。我们通过可用与已用互信息及伪先验干预,形式化定义了“存在但未使用”现象,从而得到输出可计算的诊断指标和信息预算证明。最后,通过在查询位置附近重述真实绑定的预言检查点干预,几乎能在长距离上完全消除阶段2B故障(例如Qwen2.5-3B在$k = 1024$时实现$0/400 \rightarrow 399/400$;表8)。