We reinterpret the final Large Language Model (LLM) softmax classifier as an Energy-Based Model (EBM), decomposing the sequence-to-sequence probability chain into multiple interacting EBMs at inference. This principled approach allows us to track "energy spills" during decoding, which we empirically show correlate with factual errors, biases, and failures. Similar to Orgad et al. (2025), our method localizes the exact answer token and subsequently tests for hallucinations. Crucially, however, we achieve this without requiring trained probe classifiers or activation ablations. Instead, we introduce two completely training-free metrics derived directly from output logits: spilled energy, which captures the discrepancy between energy values across consecutive generation steps that should theoretically match, and marginalized energy, which is measurable at a single step. Evaluated on nine benchmarks across state-of-the-art LLMs (including LLaMA, Mistral, and Gemma) and on synthetic algebraic operations (Qwen3), our approach demonstrates robust, competitive hallucination detection and cross-task generalization. Notably, these results hold for both pretrained and instruction-tuned variants without introducing any training overhead.
翻译:我们将最终的大型语言模型(LLM)softmax分类器重新解释为基于能量的模型(EBM),在推理过程中将序列到序列的概率链分解为多个相互作用的EBM。这种基于原理的方法使我们能够追踪解码过程中的“能量溢出”,我们通过实验证明这种溢出与事实错误、偏见和故障相关。与Orgad等人(2025)的研究类似,我们的方法能够定位确切的答案标记并随后检测幻觉。然而,关键区别在于我们无需训练探针分类器或进行激活消融即可实现这一目标。相反,我们引入了两种完全无需训练的度量指标,直接源自输出逻辑值:能量溢出(捕捉连续生成步骤间理论上应匹配的能量值差异)和边缘化能量(可在单一步骤中测量)。通过在九个基准测试中对最先进的LLM(包括LLaMA、Mistral和Gemma)以及合成代数运算(Qwen3)进行评估,我们的方法展示了鲁棒且具有竞争力的幻觉检测能力及跨任务泛化性能。值得注意的是,这些结果对于预训练模型和指令微调变体均成立,且未引入任何训练开销。