Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using sparse autoencoders as an interpretability tool, we discover that a key part of these mechanisms is entity recognition, where the model detects if an entity is one it can recall facts about. Sparse autoencoders uncover meaningful directions in the representation space, these detect whether the model recognizes an entity, e.g. detecting it doesn't know about an athlete or a movie. This suggests that models can have self-knowledge: internal representations about their own capabilities. These directions are causally relevant: capable of steering the model to refuse to answer questions about known entities, or to hallucinate attributes of unknown entities when it would otherwise refuse. We demonstrate that despite the sparse autoencoders being trained on the base model, these directions have a causal effect on the chat model's refusal behavior, suggesting that chat finetuning has repurposed this existing mechanism. Furthermore, we provide an initial exploration into the mechanistic role of these directions in the model, finding that they disrupt the attention of downstream heads that typically move entity attributes to the final token.

翻译：大型语言模型中的幻觉现象是一个普遍存在的问题，然而模型是否会产生幻觉的内在机制尚不明确，这限制了我们解决该问题的能力。通过使用稀疏自编码器作为可解释性工具，我们发现这些机制的一个关键部分是实体识别，即模型检测某个实体是否为其能够回忆相关事实的对象。稀疏自编码器在表示空间中揭示了有意义的特征方向，这些方向能够检测模型是否识别某个实体，例如检测到模型不了解某位运动员或某部电影。这表明模型可以具备自我认知：关于其自身能力的内在表征。这些特征方向具有因果相关性：能够引导模型拒绝回答关于已知实体的问题，或在对未知实体本应拒绝回答时反而幻觉出其属性。我们证明，尽管稀疏自编码器是在基础模型上训练的，但这些方向对聊天模型的拒绝行为具有因果效应，这表明聊天微调过程重新利用了这一现有机制。此外，我们对这些方向在模型中的机制作用进行了初步探索，发现它们会干扰下游注意力头部的运作，而这些头部通常负责将实体属性传递至最终标记。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/