Underlying mechanisms of memorization in LLMs -- the verbatim reproduction of training data -- remain poorly understood. What exact part of the network decides to retrieve a token that we would consider as start of memorization sequence? How exactly is the models' behaviour different when producing memorized sentence vs non-memorized? In this work we approach these questions from mechanistic interpretability standpoint by utilizing transformer circuits -- the minimal computational subgraphs that perform specific functions within the model. Through carefully constructed contrastive datasets, we identify points where model generation diverges from memorized content and isolate the specific circuits responsible for two distinct aspects of memorization. We find that circuits that initiate memorization can also maintain it once started, while circuits that only maintain memorization cannot trigger its initiation. Intriguingly, memorization prevention mechanisms transfer robustly across different text domains, while memorization induction appears more context-dependent.
翻译:大型语言模型中的记忆机制——即对训练数据的逐字复现——至今仍未得到充分理解。究竟是网络的哪个具体部分决定检索出被我们视为记忆序列起始的标记?模型在生成记忆性语句与非记忆性语句时的行为究竟有何差异?本研究从机制可解释性视角出发,利用Transformer电路——即模型内执行特定功能的最小计算子图——来探讨这些问题。通过精心构建的对比数据集,我们定位了模型生成偏离记忆内容的关键节点,并分离出负责记忆两个不同维度的特定电路。研究发现,启动记忆的电路在记忆开始后也能维持记忆,而仅维持记忆的电路则无法触发记忆启动。值得注意的是,记忆预防机制在不同文本领域间展现出稳健的迁移性,而记忆诱导机制则表现出更强的语境依赖性。