Knowledge editing techniques have been increasingly adopted to efficiently correct the false or outdated knowledge in Large Language Models (LLMs), due to the high cost of retraining from scratch. Meanwhile, one critical but under-explored question is: can knowledge editing be used to inject harm into LLMs? In this paper, we propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely Editing Attack, and conduct a systematic investigation with a newly constructed dataset EditAttack. Specifically, we focus on two typical safety risks of Editing Attack including Misinformation Injection and Bias Injection. For the risk of misinformation injection, we first categorize it into commonsense misinformation injection and long-tail misinformation injection. Then, we find that editing attacks can inject both types of misinformation into LLMs, and the effectiveness is particularly high for commonsense misinformation injection. For the risk of bias injection, we discover that not only can biased sentences be injected into LLMs with high effectiveness, but also one single biased sentence injection can cause a high bias increase in general outputs of LLMs, which are even highly irrelevant to the injected sentence, indicating a catastrophic impact on the overall fairness of LLMs. Then, we further illustrate the high stealthiness of editing attacks, measured by their impact on the general knowledge and reasoning capacities of LLMs, and show the hardness of defending editing attacks with empirical evidence. Our discoveries demonstrate the emerging misuse risks of knowledge editing techniques on compromising the safety alignment of LLMs.
翻译:知识编辑技术因其能高效修正大语言模型中的错误或过时知识,且避免了从头重新训练的高昂成本,正得到日益广泛的应用。与此同时,一个关键但尚未被充分探讨的问题是:知识编辑技术能否被用于向大语言模型注入危害?本文提出将知识编辑重新定义为一种针对大语言模型的新型安全威胁,即“编辑攻击”,并利用新构建的数据集EditAttack进行了系统性研究。具体而言,我们聚焦于编辑攻击带来的两种典型安全风险:虚假信息注入与偏见注入。针对虚假信息注入风险,我们首先将其划分为常识性虚假信息注入与长尾虚假信息注入。研究发现,编辑攻击能够成功向大语言模型注入这两类虚假信息,且对常识性虚假信息注入的效果尤为显著。针对偏见注入风险,我们发现不仅能够高效地将带有偏见的语句注入模型,而且单次偏见语句的注入即可导致大语言模型在高度无关的通用输出中偏见水平显著上升,这表明其对模型整体公平性可能产生灾难性影响。进一步,我们通过评估编辑攻击对模型通用知识与推理能力的影响,揭示了此类攻击具有高度隐蔽性,并基于实证证据表明防御编辑攻击的艰巨性。我们的发现揭示了知识编辑技术在破坏大语言模型安全对齐方面存在的潜在滥用风险。