Large Language Models (LLMs) have emerged as a new information channel. Meanwhile, one critical but under-explored question is: Is it possible to bypass the safety alignment and inject harmful information into LLMs stealthily? In this paper, we propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely Editing Attack, and conduct a systematic investigation with a newly constructed dataset EditAttack. Specifically, we focus on two typical safety risks of Editing Attack including Misinformation Injection and Bias Injection. For the first risk, we find that editing attacks can inject both commonsense and long-tail misinformation into LLMs, and the effectiveness for the former one is particularly high. For the second risk, we discover that not only can biased sentences be injected into LLMs with high effectiveness, but also one single biased sentence injection can degrade the overall fairness. Then, we further illustrate the high stealthiness of editing attacks. Our discoveries demonstrate the emerging misuse risks of knowledge editing techniques on compromising the safety alignment of LLMs and the feasibility of disseminating misinformation or bias with LLMs as new channels.
翻译:大型语言模型(LLMs)已成为一种新的信息传播渠道。与此同时,一个关键但尚未被充分探讨的问题是:是否有可能绕过安全对齐机制,隐蔽地向LLMs注入有害信息?在本文中,我们提出将知识编辑重新定义为一种针对LLMs的新型安全威胁,即编辑攻击,并利用新构建的数据集EditAttack进行系统性研究。具体而言,我们重点关注编辑攻击的两类典型安全风险:虚假信息注入与偏见注入。针对第一类风险,我们发现编辑攻击能够向LLMs注入常识性虚假信息与长尾虚假信息,且前者注入效果尤为显著。针对第二类风险,我们不仅验证了偏见语句能够以高成功率注入LLMs,更发现仅注入单条偏见语句即可损害模型的整体公平性。此外,我们进一步揭示了编辑攻击具有高度隐蔽性。我们的发现表明,知识编辑技术存在被滥用于破坏LLMs安全对齐机制的新兴风险,且利用LLMs作为新渠道传播虚假信息或偏见的可行性确实存在。