Concerns regarding Large Language Models (LLMs) to memorize and disclose private information, particularly Personally Identifiable Information (PII), become prominent within the community. Many efforts have been made to mitigate the privacy risks. However, the mechanism through which LLMs memorize PII remains poorly understood. To bridge this gap, we introduce a pioneering method for pinpointing PII-sensitive neurons (privacy neurons) within LLMs. Our method employs learnable binary weight masks to localize specific neurons that account for the memorization of PII in LLMs through adversarial training. Our investigations discover that PII is memorized by a small subset of neurons across all layers, which shows the property of PII specificity. Furthermore, we propose to validate the potential in PII risk mitigation by deactivating the localized privacy neurons. Both quantitative and qualitative experiments demonstrate the effectiveness of our neuron localization algorithm.
翻译:关于大型语言模型记忆并泄露隐私信息,特别是个人身份信息的担忧,在社区中日益突出。尽管已有诸多努力来减轻隐私风险,但大型语言模型记忆个人身份信息的机制仍知之甚少。为填补这一空白,我们提出了一种开创性方法,用于精确定位大型语言模型中对个人身份信息敏感的神经元。该方法采用可学习的二进制权重掩码,通过对抗训练定位造成大型语言模型记忆个人身份信息的具体神经元。我们的研究发现,个人身份信息由所有层中一小部分神经元记忆,这体现了个人身份信息的特异性。此外,我们建议通过禁用定位到的隐私神经元来验证个人身份信息风险缓解的潜力。定量与定性实验均证明了我们神经元定位算法的有效性。