We propose a lightweight and single-pass uncertainty quantification method for detecting hallucinations in Large Language Models. The method uses attention matrices to estimate uncertainty without requiring repeated sampling or external models. Specifically, we measure the Kullback-Leibler divergence between each attention head's distribution and a uniform reference distribution, and use these features in a logistic regression probe. Across multiple datasets, task types, and model families, attention divergence is highly predictive of answer correctness and performs competitively with existing uncertainty estimation methods. We find that this signal is concentrated in middle layers and on factual tokens such as named entities and numbers, suggesting that attention dynamics provides an efficient and interpretable white-box signal of model uncertainty.
翻译:我们提出了一种轻量级且单次前向传播的不确定性量化方法,用于检测大语言模型中的幻觉现象。该方法利用注意力矩阵估计不确定性,无需重复采样或依赖外部模型。具体而言,我们计算每个注意力头的分布与均匀参考分布之间的Kullback-Leibler散度,并将这些特征用于逻辑回归探针。在多个数据集、任务类型和模型族上,注意力发散性对答案正确性具有高度预测能力,且性能与现有不确定性估计方法相当。我们发现,该信号集中在中间层以及命名实体、数字等事实性token上,这表明注意力动态提供了模型不确定性的一种高效且可解释的白盒信号。