Representation Shattering in Transformers: A Synthetic Study with Knowledge Editing

Knowledge Editing (KE) algorithms alter models' weights to perform targeted updates to incorrect, outdated, or otherwise unwanted factual associations. To better identify the possibilities and limitations of these approaches, recent work has shown that applying KE can adversely affect models' factual recall accuracy and diminish their general reasoning abilities. While these studies give broad insights into the potential harms of KE algorithms, e.g., via performance evaluations on benchmarks, we argue little is understood as to why such destructive failures occur. Is it possible KE methods distort representations of concepts beyond the targeted fact, hence hampering abilities at broad? If so, what is the extent of this distortion? Motivated by such questions, we define a novel synthetic task wherein a Transformer is trained from scratch to internalize a "structured" knowledge graph. The structure enforces relationships between entities of the graph, such that editing a factual association has "trickling effects" on other entities in the graph (e.g., altering X's parent is Y to Z affects who X's siblings' parent is). Through evaluations of edited models and analysis of extracted representations, we show that KE inadvertently affects representations of entities beyond the targeted one, distorting relevant structures that allow a model to infer unseen knowledge about an entity. We call this phenomenon representation shattering and demonstrate that it results in degradation of factual recall and reasoning performance more broadly. To corroborate our findings in a more naturalistic setup, we perform preliminary experiments with pre-trained Llama and Mamba models, reproducing the representation shattering effect therein as well. Overall, our work yields a precise mechanistic hypothesis to explain why KE has adverse effects on model abilities.

翻译：知识编辑算法通过修改模型权重来实现对错误、过时或其他不良事实关联的定向更新。为更好地识别这些方法的可能性与局限性，近期研究表明应用知识编辑可能损害模型的事实回忆准确性并削弱其泛化推理能力。尽管这些研究通过基准测试的性能评估为知识编辑算法的潜在危害提供了宏观见解，但我们认为学界对于此类破坏性失效的发生原因仍知之甚少。知识编辑方法是否会扭曲目标事实之外的概念表征，从而广泛损害模型能力？若是如此，这种扭曲的程度如何？受此类问题启发，我们定义了一项新颖的合成任务：从头训练Transformer以内化"结构化"知识图谱。该结构强制规定了图谱实体间的关系，使得编辑某个事实关联会对图谱中其他实体产生"连锁效应"（例如将"X的父母是Y"修改为Z会影响X的兄弟姐妹的父母归属）。通过对编辑后模型的评估及提取表征的分析，我们证明知识编辑会无意中影响目标实体之外的表征，扭曲那些使模型能够推断实体未知知识的相关结构。我们将此现象称为表征破碎，并证明其会导致事实回忆与推理性能的更广泛退化。为在更自然的设置中验证发现，我们对预训练的Llama和Mamba模型进行了初步实验，同样复现了表征破碎效应。总体而言，本研究提出了精确的机制假说，用以解释知识编辑为何会对模型能力产生负面影响。