Large language models (LLMs) can memorize many pretrained sequences verbatim. This paper studies if we can locate a small set of neurons in LLMs responsible for memorizing a given sequence. While the concept of localization is often mentioned in prior work, methods for localization have never been systematically and directly evaluated; we address this with two benchmarking approaches. In our INJ Benchmark, we actively inject a piece of new information into a small subset of LLM weights and measure whether localization methods can identify these "ground truth" weights. In the DEL Benchmark, we study localization of pretrained data that LLMs have already memorized; while this setting lacks ground truth, we can still evaluate localization by measuring whether dropping out located neurons erases a memorized sequence from the model. We evaluate five localization methods on our two benchmarks, and both show similar rankings. All methods exhibit promising localization ability, especially for pruning-based methods, though the neurons they identify are not necessarily specific to a single memorized sequence.
翻译:大型语言模型(LLMs)能够逐字记忆大量预训练序列。本文研究是否可以在LLMs中定位负责记忆特定序列的少量神经元。尽管局部化概念在先前研究中常被提及,但局部化方法从未经过系统且直接的评估;我们通过两种基准测试方法来解决这一问题。在INJ基准测试中,我们主动将一条新信息注入LLM的一小部分权重中,并检测局部化方法能否识别这些“真实”权重。在DEL基准测试中,我们研究LLMs已记忆的预训练数据的局部化;尽管该场景缺乏真实基准,我们仍可通过检测丢弃定位神经元是否从模型中擦除记忆序列来评估局部化效果。我们在两个基准测试上评估了五种局部化方法,两者均显示出相似的排序。所有方法均展现出有前景的局部化能力,尤其是基于剪枝的方法,尽管它们识别的神经元未必仅针对单一记忆序列。