We reveal new methods and the theoretical foundations of techniques for editing large language models. We also show how the new theory can be used to assess the editability of models and to expose their susceptibility to previously unknown malicious attacks. Our theoretical approach shows that a single metric (a specific measure of the intrinsic dimensionality of the model's features) is fundamental to predicting the success of popular editing approaches, and reveals new bridges between disparate families of editing methods. We collectively refer to these approaches as stealth editing methods, because they aim to directly and inexpensively update a model's weights to correct the model's responses to known hallucinating prompts without otherwise affecting the model's behaviour, without requiring retraining. By carefully applying the insight gleaned from our theoretical investigation, we are able to introduce a new network block -- named a jet-pack block -- which is optimised for highly selective model editing, uses only standard network operations, and can be inserted into existing networks. The intrinsic dimensionality metric also determines the vulnerability of a language model to a stealth attack: a small change to a model's weights which changes its response to a single attacker-chosen prompt. Stealth attacks do not require access to or knowledge of the model's training data, therefore representing a potent yet previously unrecognised threat to redistributed foundation models. They are computationally simple enough to be implemented in malware in many cases. Extensive experimental results illustrate and support the method and its theoretical underpinnings. Demos and source code for editing language models are available at https://github.com/qinghua-zhou/stealth-edits.
翻译:我们揭示了编辑大型语言模型的新方法及其理论基础。我们还展示了如何利用新理论来评估模型的可编辑性,并揭示其对先前未知恶意攻击的易感性。我们的理论方法表明,单一指标(模型特征内在维度的特定度量)对于预测主流编辑方法的成功至关重要,并揭示了不同编辑方法家族之间的新联系。我们统称这些方法为隐形编辑方法,因为它们旨在直接且低成本地更新模型权重,以修正模型对已知产生幻觉提示的响应,同时不影响模型的其他行为,且无需重新训练。通过谨慎应用从理论研究中获得的洞见,我们能够引入一种新的网络模块——命名为喷气背包模块——该模块专为高选择性模型编辑而优化,仅使用标准网络操作,并可插入现有网络中。内在维度指标也决定了语言模型对隐形攻击的脆弱性:对模型权重进行微小改变,即可改变其对攻击者选定单一提示的响应。隐形攻击无需访问或了解模型的训练数据,因此代表了再分发基础模型中一种强大却先前未被识别的威胁。在许多情况下,其计算足够简单,可在恶意软件中实现。大量实验结果阐释并支持了该方法及其理论基础。编辑语言模型的演示和源代码可在 https://github.com/qinghua-zhou/stealth-edits 获取。