Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using sparse autoencoders as an interpretability tool, we discover that a key part of these mechanisms is entity recognition, where the model detects if an entity is one it can recall facts about. Sparse autoencoders uncover meaningful directions in the representation space, these detect whether the model recognizes an entity, e.g. detecting it doesn't know about an athlete or a movie. This suggests that models can have self-knowledge: internal representations about their own capabilities. These directions are causally relevant: capable of steering the model to refuse to answer questions about known entities, or to hallucinate attributes of unknown entities when it would otherwise refuse. We demonstrate that despite the sparse autoencoders being trained on the base model, these directions have a causal effect on the chat model's refusal behavior, suggesting that chat finetuning has repurposed this existing mechanism. Furthermore, we provide an initial exploration into the mechanistic role of these directions in the model, finding that they disrupt the attention of downstream heads that typically move entity attributes to the final token.
翻译:大型语言模型中的幻觉现象是一个普遍存在的问题,然而模型是否会产生幻觉的内在机制尚不明确,这限制了我们解决该问题的能力。通过使用稀疏自编码器作为可解释性工具,我们发现这些机制的一个关键部分是实体识别,即模型检测某个实体是否为其能够回忆相关事实的对象。稀疏自编码器在表示空间中揭示了有意义的特征方向,这些方向能够检测模型是否识别某个实体,例如检测到模型不了解某位运动员或某部电影。这表明模型可以具备自我认知:关于其自身能力的内在表征。这些特征方向具有因果相关性:能够引导模型拒绝回答关于已知实体的问题,或在对未知实体本应拒绝回答时反而幻觉出其属性。我们证明,尽管稀疏自编码器是在基础模型上训练的,但这些方向对聊天模型的拒绝行为具有因果效应,这表明聊天微调过程重新利用了这一现有机制。此外,我们对这些方向在模型中的机制作用进行了初步探索,发现它们会干扰下游注意力头部的运作,而这些头部通常负责将实体属性传递至最终标记。