Large Language Models (LLMs) have become powerful, but hallucinations remain a vital obstacle to their trustworthy use. Previous works improved the capability of hallucination detection by measuring uncertainty. But they can not explain the provenance behind why hallucinations occur, particularly in identifying which part of the inputs tends to trigger hallucinations. Recent works on the prompt attack indicate that uncertainty exists in semantic propagation, where attention mechanisms gradually fuse local token information into high-level semantics across layers. Meanwhile, uncertainty also emerges in language generation, due to its probability-based selection of high-level semantics for sampled generations. Based on that, we propose RePPL to recalibrate uncertainty measurement by these two aspects, which dispatches explainable uncertainty scores to each token and aggregates in Perplexity-style Log-Average form as a total score. Experiments show that it achieves the best comprehensive detection performance across various QA datasets on advanced models (average AUC of 0.833), and it is capable of producing token-level uncertainty scores as explanations of hallucination.
翻译:大型语言模型(LLMs)已变得强大,但幻觉仍是其可信应用的关键障碍。先前研究通过测量不确定性提升了幻觉检测能力,但无法解释幻觉产生的根源,尤其难以识别输入的哪些部分倾向于引发幻觉。近期关于提示攻击的研究表明,不确定性存在于语义传播过程中——注意力机制在各层中逐步将局部词元信息融合为高层语义;同时,不确定性也出现在语言生成阶段,因其基于概率从高层语义中进行采样生成。基于此,我们提出RePPL,通过这两个维度重校准不确定性度量:该方法为每个词元分配可解释的不确定性分数,并以困惑度风格的Log-Average形式聚合为总分。实验表明,该方法在先进模型上的多种问答数据集中取得了最佳综合检测性能(平均AUC达0.833),并能生成词元级不确定性分数作为幻觉的解释依据。