Large language models (LLMs) can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context -- this phenomenon, known as \emph{context-memory knowledge conflicts}, can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. Analysing the internal activations of LLMs, we find that they can internally register the signals of knowledge conflict at mid-layers. Such signals allow us to detect whether a knowledge conflict occurs and use \emph{inference-time} intervention strategies to resolve it. In this work, we propose \textsc{SpARE}, a \emph{training-free} representation engineering method that uses pre-trained sparse auto-encoders (SAEs) to control the knowledge selection behaviour of LLMs. \textsc{SpARE} identifies the functional features that control the knowledge selection behaviours and applies them to edit the internal activations of LLMs at inference time. Our experimental results show that \textsc{SpARE} can effectively control the usage of either knowledge source to resolve knowledge conflict in open-domain question-answering tasks, surpassing existing representation engineering methods ($+10\%$) as well as contrastive decoding methods ($+15\%$).
翻译:大语言模型(LLMs)能够在其参数中存储大量事实性知识。然而,其参数化知识可能与上下文提供的信息发生冲突——这种现象被称为“上下文-记忆知识冲突”,可能导致模型出现不良行为,例如依赖过时或错误的信息。通过分析大语言模型的内部激活,我们发现模型能够在中间层内部记录知识冲突的信号。此类信号使我们能够检测知识冲突是否发生,并利用“推理时”干预策略来解决冲突。在本工作中,我们提出\textsc{SpARE},一种“无需训练”的表征工程方法,该方法利用预训练的稀疏自编码器(SAEs)来控制大语言模型的知识选择行为。\textsc{SpARE}识别出控制知识选择行为的功能性特征,并应用这些特征在推理时编辑大语言模型的内部激活。我们的实验结果表明,\textsc{SpARE}能够有效控制任一知识源的使用,以解决开放域问答任务中的知识冲突,其性能超越了现有的表征工程方法(+10%)以及对比解码方法(+15%)。