Knowledge hallucination have raised widespread concerns for the security and reliability of deployed LLMs. Previous efforts in detecting hallucinations have been employed at logit-level uncertainty estimation or language-level self-consistency evaluation, where the semantic information is inevitably lost during the token-decoding procedure. Thus, we propose to explore the dense semantic information retained within LLMs' \textbf{IN}ternal \textbf{S}tates for halluc\textbf{I}nation \textbf{DE}tection (\textbf{INSIDE}). In particular, a simple yet effective \textbf{EigenScore} metric is proposed to better evaluate responses' self-consistency, which exploits the eigenvalues of responses' covariance matrix to measure the semantic consistency/diversity in the dense embedding space. Furthermore, from the perspective of self-consistent hallucination detection, a test time feature clipping approach is explored to truncate extreme activations in the internal states, which reduces overconfident generations and potentially benefits the detection of overconfident hallucinations. Extensive experiments and ablation studies are performed on several popular LLMs and question-answering (QA) benchmarks, showing the effectiveness of our proposal.
翻译:知识幻觉问题已对已部署大语言模型的安全性和可靠性引发广泛担忧。既有幻觉检测工作主要基于logit级不确定性估计或语言级自一致性评估,但此类方法在令牌解码过程中不可避免地损失了语义信息。为此,我们提出探索大语言模型**内**部**状**态中保留的稠密语义信息用于幻**觉**检测(INSIDE)。具体而言,我们提出一种简洁高效的**特征谱评分**指标以更好评估响应的自一致性,该指标通过利用响应协方差矩阵的特征值来衡量稠密嵌入空间中的语义一致性/多样性。进一步地,从自一致性幻觉检测视角出发,我们探索了一种测试时特征裁剪方法,用于截断内部状态中的极端激活,从而减少过度自信生成并可能有利于检测过度自信幻觉。在多个主流大语言模型和问答基准上开展的广泛实验与消融研究验证了所提方法的有效性。