While large language models (LLMs) have taken great strides towards helping humans with a plethora of tasks, hallucinations remain a major impediment towards gaining user trust. The fluency and coherence of model generations even when hallucinating makes detection a difficult task. In this work, we explore if the artifacts associated with the model generations can provide hints that the generation will contain hallucinations. Specifically, we probe LLMs at 1) the inputs via Integrated Gradients based token attribution, 2) the outputs via the Softmax probabilities, and 3) the internal state via self-attention and fully-connected layer activations for signs of hallucinations on open-ended question answering tasks. Our results show that the distributions of these artifacts tend to differ between hallucinated and non-hallucinated generations. Building on this insight, we train binary classifiers that use these artifacts as input features to classify model generations into hallucinations and non-hallucinations. These hallucination classifiers achieve up to $0.80$ AUROC. We also show that tokens preceding a hallucination can already predict the subsequent hallucination even before it occurs.
翻译:尽管大型语言模型(LLM)在协助人类完成众多任务方面取得了巨大进展,但幻觉问题仍然是获取用户信任的主要障碍。即使在产生幻觉时,模型生成的文本依然流畅连贯,这使得检测工作变得困难。在本研究中,我们探讨了与模型生成过程相关的特征是否能提供线索,预示生成内容将包含幻觉。具体而言,我们在开放域问答任务中,从以下三个方面探测LLM是否存在幻觉迹象:1)通过基于积分梯度的词元归因分析输入;2)通过Softmax概率分析输出;3)通过自注意力机制和全连接层激活分析内部状态。我们的结果表明,这些特征的分布在幻觉生成与非幻觉生成之间存在差异。基于这一发现,我们训练了二元分类器,将这些特征作为输入来区分模型的幻觉生成与非幻觉生成。这些幻觉分类器实现了高达$0.80$的AUROC值。我们还证明,在幻觉发生之前,其前序词元已经能够预测后续即将出现的幻觉。