Large language models pretrained on a huge amount of data capture rich knowledge and information in the training data. The ability of data memorization and regurgitation in pretrained language models, revealed in previous studies, brings the risk of data leakage. In order to effectively reduce these risks, we propose a framework DEPN to Detect and Edit Privacy Neurons in pretrained language models, partially inspired by knowledge neurons and model editing. In DEPN, we introduce a novel method, termed as privacy neuron detector, to locate neurons associated with private information, and then edit these detected privacy neurons by setting their activations to zero. Furthermore, we propose a privacy neuron aggregator dememorize private information in a batch processing manner. Experimental results show that our method can significantly and efficiently reduce the exposure of private data leakage without deteriorating the performance of the model. Additionally, we empirically demonstrate the relationship between model memorization and privacy neurons, from multiple perspectives, including model size, training time, prompts, privacy neuron distribution, illustrating the robustness of our approach.
翻译:大规模预训练于海量数据上的语言模型,从训练数据中捕获了丰富的知识信息。已有研究表明,预训练语言模型具有数据记忆和复现的能力,这带来了数据泄露的风险。为有效降低这些风险,受知识神经元和模型编辑方法的启发,我们提出了DEPN框架(Detect and Edit Privacy Neurons),用于检测与编辑预训练语言模型中的隐私神经元。在DEPN中,我们引入了一种名为隐私神经元检测器的新方法,用于定位与隐私信息相关的神经元,随后通过将激活值置零的方式编辑这些检测到的隐私神经元。此外,我们提出了一种隐私神经元聚合器,以批处理方式消除私有信息。实验结果表明,我们的方法能够在不损害模型性能的前提下,显著且高效地降低私有数据泄露的风险。同时,我们从模型规模、训练时间、提示词、隐私神经元分布等多个角度,通过实证证明了模型记忆与隐私神经元之间的关联,展示了我们方法的鲁棒性。