Pretrained language models sometimes possess knowledge that we do not wish them to, including memorized personal information and knowledge that could be used to harm people. They can also output toxic or harmful text. To mitigate these safety and informational issues, we propose an attack-and-defense framework for studying the task of deleting sensitive information directly from model weights. We study direct edits to model weights because (1) this approach should guarantee that particular deleted information is never extracted by future prompt attacks, and (2) it should protect against whitebox attacks, which is necessary for making claims about safety/privacy in a setting where publicly available model weights could be used to elicit sensitive information. Our threat model assumes that an attack succeeds if the answer to a sensitive question is located among a set of B generated candidates, based on scenarios where the information would be insecure if the answer is among B candidates. Experimentally, we show that even state-of-the-art model editing methods such as ROME struggle to truly delete factual information from models like GPT-J, as our whitebox and blackbox attacks can recover "deleted" information from an edited model 38% of the time. These attacks leverage two key observations: (1) that traces of deleted information can be found in intermediate model hidden states, and (2) that applying an editing method for one question may not delete information across rephrased versions of the question. Finally, we provide new defense methods that protect against some extraction attacks, but we do not find a single universally effective defense method. Our results suggest that truly deleting sensitive information is a tractable but difficult problem, since even relatively low attack success rates have potentially severe societal implications for real-world deployment of language models.
翻译:预训练语言模型有时会拥有我们不愿其拥有的知识,包括记忆的个人信息以及可能被用于伤害他人的知识。它们还可能产生有毒或有害的文本。为缓解这些安全与信息问题,我们提出了一种攻击-防御框架,用于研究直接从模型权重中删除敏感信息的任务。我们研究对模型权重的直接编辑,因为:(1)该方法应确保特定删除的信息永远不会被未来的提示攻击提取;(2)它应能防御白盒攻击,这是在对公开模型权重可能被用于获取敏感信息的环境中提出安全/隐私主张的必要条件。我们的威胁模型假设:当敏感问题的答案位于一组B个生成候选项中时,攻击即成功,这基于若答案在B个候选项中则信息不安全的场景。实验表明,即使是ROME等最先进的模型编辑方法,也难以从GPT-J等模型中真正删除事实性信息——我们的白盒与黑盒攻击在38%的情况下能从编辑后的模型中恢复“已删除”信息。这些攻击利用了以下两个关键发现:(1)已删除信息的痕迹可在模型中间隐藏状态中找到;(2)对一个问题应用编辑方法可能无法删除该问题改写版本中的信息。最后,我们提供了能抵御某些提取攻击的新防御方法,但并未发现单一普遍有效的防御方法。结果表明,真正删除敏感信息是一个可处理但困难的问题,因为即使相对较低的攻擊成功率,也可能对语言模型在现实世界部署带来潜在严重的社会影响。