Language models (LMs) may memorize personally identifiable information (PII) from training data, enabling adversaries to extract it during inference. Existing defense mechanisms such as differential privacy (DP) reduce this leakage, but incur large drops in utility. Based on a comprehensive study using circuit discovery to identify the computational circuits responsible PII leakage in LMs, we hypothesize that specific PII leakage circuits in LMs should be responsible for this behavior. Therefore, we propose PATCH (Privacy-Aware Targeted Circuit PatcHing), a novel approach that first identifies and subsequently directly edits PII circuits to reduce leakage. PATCH achieves better privacy-utility trade-off than existing defenses, e.g., reducing recall of PII leakage from LMs by up to 65%. Finally, PATCH can be combined with DP to reduce recall of residual leakage of an LM to as low as 0.01%. Our analysis shows that PII leakage circuits persist even after the application of existing defense mechanisms. In contrast, PATCH can effectively mitigate their impact.
翻译:语言模型(LMs)可能记忆训练数据中的个人身份信息(PII),使得攻击者能够在推理阶段提取这些信息。现有的防御机制(如差分隐私(DP))可减少此类泄露,但会导致模型效用大幅下降。基于一项利用电路发现技术识别语言模型中负责PII泄露的计算回路的综合性研究,我们假设语言模型中特定的PII泄露回路应对此行为负责。因此,我们提出PATCH(隐私感知的定向电路修补),这是一种新颖的方法,首先识别并随后直接编辑PII回路以减少泄露。与现有防御方法相比,PATCH实现了更好的隐私-效用权衡,例如将语言模型的PII泄露召回率降低了高达65%。最后,PATCH可与DP结合使用,将语言模型的残余泄露召回率降低至0.01%。我们的分析表明,即使在应用现有防御机制后,PII泄露回路仍然存在。相比之下,PATCH能够有效减轻其影响。