Stealth edits for provably fixing or attacking large language models

We reveal new methods and the theoretical foundations of techniques for editing large language models. We also show how the new theory can be used to assess the editability of models and to expose their susceptibility to previously unknown malicious attacks. Our theoretical approach shows that a single metric (a specific measure of the intrinsic dimensionality of the model's features) is fundamental to predicting the success of popular editing approaches, and reveals new bridges between disparate families of editing methods. We collectively refer to these approaches as stealth editing methods, because they aim to directly and inexpensively update a model's weights to correct the model's responses to known hallucinating prompts without otherwise affecting the model's behaviour, without requiring retraining. By carefully applying the insight gleaned from our theoretical investigation, we are able to introduce a new network block -- named a jet-pack block -- which is optimised for highly selective model editing, uses only standard network operations, and can be inserted into existing networks. The intrinsic dimensionality metric also determines the vulnerability of a language model to a stealth attack: a small change to a model's weights which changes its response to a single attacker-chosen prompt. Stealth attacks do not require access to or knowledge of the model's training data, therefore representing a potent yet previously unrecognised threat to redistributed foundation models. They are computationally simple enough to be implemented in malware in many cases. Extensive experimental results illustrate and support the method and its theoretical underpinnings. Demos and source code for editing language models are available at https://github.com/qinghua-zhou/stealth-edits.

翻译：我们揭示了编辑大型语言模型的新方法及其理论基础。我们还展示了如何利用新理论来评估模型的可编辑性，并揭示其对先前未知恶意攻击的易感性。我们的理论方法表明，单一指标（模型特征内在维度的特定度量）对于预测主流编辑方法的成功至关重要，并揭示了不同编辑方法家族之间的新联系。我们统称这些方法为隐形编辑方法，因为它们旨在直接且低成本地更新模型权重，以修正模型对已知产生幻觉提示的响应，同时不影响模型的其他行为，且无需重新训练。通过谨慎应用从理论研究中获得的洞见，我们能够引入一种新的网络模块——命名为喷气背包模块——该模块专为高选择性模型编辑而优化，仅使用标准网络操作，并可插入现有网络中。内在维度指标也决定了语言模型对隐形攻击的脆弱性：对模型权重进行微小改变，即可改变其对攻击者选定单一提示的响应。隐形攻击无需访问或了解模型的训练数据，因此代表了再分发基础模型中一种强大却先前未被识别的威胁。在许多情况下，其计算足够简单，可在恶意软件中实现。大量实验结果阐释并支持了该方法及其理论基础。编辑语言模型的演示和源代码可在 https://github.com/qinghua-zhou/stealth-edits 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日