Large language models pretrained on a huge amount of data capture rich knowledge and information in the training data. The ability of data memorization and regurgitation in pretrained language models, revealed in previous studies, brings the risk of data leakage. In order to effectively reduce these risks, we propose a framework DEPN to Detect and Edit Privacy Neurons in pretrained language models, partially inspired by knowledge neurons and model editing. In DEPN, we introduce a novel method, termed as privacy neuron detector, to locate neurons associated with private information, and then edit these detected privacy neurons by setting their activations to zero. Furthermore, we propose a privacy neuron aggregator dememorize private information in a batch processing manner. Experimental results show that our method can significantly and efficiently reduce the exposure of private data leakage without deteriorating the performance of the model. Additionally, we empirically demonstrate the relationship between model memorization and privacy neurons, from multiple perspectives, including model size, training time, prompts, privacy neuron distribution, illustrating the robustness of our approach.
翻译:大规模预训练于海量数据的语言模型捕捉了训练数据中的丰富知识与信息。先前研究表明,预训练语言模型的数据记忆与复现能力带来了数据泄露的风险。为有效降低这些风险,受知识神经元与模型编辑的启发,我们提出DEPN框架以检测并编辑预训练语言模型中的隐私神经元。在DEPN中,我们引入一种名为隐私神经元探测器的新方法,用于定位与隐私信息相关的神经元,并通过将激活值设为零来编辑这些检测到的隐私神经元。此外,我们提出一种隐私神经元聚合器,以批处理方式消除隐私记忆。实验结果表明,我们的方法能在不损害模型性能的前提下,显著且高效地降低隐私数据泄露的风险。同时,我们从模型规模、训练时间、提示词及隐私神经元分布等多个角度,实证展示了模型记忆与隐私神经元之间的关系,证明了我们方法的鲁棒性。