Detecting hallucinations in large language models is a critical open problem with significant implications for safety and reliability. While existing hallucination detection methods achieve strong performance in question-answering tasks, they remain less effective on tasks requiring reasoning. In this work, we revisit hallucination detection through the lens of out-of-distribution (OOD) detection, a well-studied problem in areas like computer vision. Treating next-token prediction in language models as a classification task allows us to apply OOD techniques, provided appropriate modifications are made to account for the structural differences in large language models. We show that OOD-based approaches yield training-free, single-sample-based detectors, achieving strong accuracy in hallucination detection for reasoning tasks. Overall, our work suggests that reframing hallucination detection as OOD detection provides a promising and scalable pathway toward language model safety.
翻译:检测大型语言模型中的幻觉是一个关键且未解决的问题,对安全性和可靠性具有重大影响。尽管现有幻觉检测方法在问答任务中表现良好,但在需要推理的任务上效果仍不理想。本文通过分布外检测(OOD)这一在计算机视觉等领域已广泛研究的问题视角重新审视幻觉检测。将语言模型中的下一个词预测视为分类任务,使我们能够应用OOD技术,但需根据大型语言模型的结构差异进行适当修改。我们证明,基于OOD的方法可构建无需训练、基于单样本的检测器,在推理任务的幻觉检测中实现高精度。总体而言,我们的研究表明,将幻觉检测重新定义为OOD检测为语言模型安全性提供了一条有前景且可扩展的路径。