State-of-the-art language models (LMs) sometimes generate non-factual hallucinations that misalign with world knowledge. Despite extensive efforts to detect and mitigate hallucinations, understanding their internal mechanisms remains elusive. Our study investigates the mechanistic causes of hallucination, specifically non-factual ones where the LM incorrectly predicts object attributes in response to subject-relation queries. With causal mediation analysis and embedding space projection, we identify two general mechanistic causes of hallucinations shared across LMs of various scales and designs: 1) insufficient subject attribute knowledge in lower layer MLPs, and 2) failing to select the correct object attribute in upper layer attention heads and MLPs. These two mechanisms exhibit varying degrees of subject-object association, predictive uncertainty and perturbation robustness. Additionally, we scrutinize LM pre-training checkpoints, revealing distinct learning dynamics for the two mechanistic causes of hallucinations. We also highlight how attribution features from our causal analysis can effectively construct hallucination detectors. Our work proposes a mechanistic understanding of LM factual errors.
翻译:最先进的语言模型有时会生成与客观世界知识不符的非事实性幻觉。尽管已有大量工作致力于检测和缓解幻觉现象,但其内部机制仍鲜为人知。本研究聚焦于非事实性幻觉的机械论成因——即模型在回答"主体-关系"查询时错误预测客体属性的情况。通过因果中介分析与嵌入空间投影,我们在不同规模与架构的语言模型中发现了两种共通的幻觉产生机制:1) 浅层多层感知机中主体属性知识不足,以及2) 深层注意力头与多层感知机未能正确选择客体属性。这两种机制在主体-客体关联强度、预测不确定性与扰动鲁棒性方面呈现差异。进一步地,我们审视了语言模型的预训练检查点,揭示了这两种机械论成因在学习动力学上的显著差异。研究还表明,因果分析中的归因特征可有效构建幻觉检测器。本工作为理解语言模型事实性错误提供了机制层面的认知。