Large Language Model Unlearning via Embedding-Corrupted Prompts

Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a large language model should not know is important for ensuring alignment and thus safe use. However, accurately and efficiently unlearning knowledge from an LLM remains challenging due to the potential collateral damage caused by the fuzzy boundary between retention and forgetting, and the large computational requirements for optimization across state-of-the-art models with hundreds of billions of parameters. In this work, we present \textbf{Embedding-COrrupted (ECO) Prompts}, a lightweight unlearning framework for large language models to address both the challenges of knowledge entanglement and unlearning efficiency. Instead of relying on the LLM itself to unlearn, we enforce an unlearned state during inference by employing a prompt classifier to identify and safeguard prompts to forget. We learn corruptions added to prompt embeddings via zeroth order optimization toward the unlearning objective offline and corrupt prompts flagged by the classifier during inference. We find that these embedding-corrupted prompts not only lead to desirable outputs that satisfy the unlearning objective but also closely approximate the output from a model that has never been trained on the data intended for forgetting. Through extensive experiments on unlearning, we demonstrate the superiority of our method in achieving promising unlearning at \textit{nearly zero side effects} in general domains and domains closely related to the unlearned ones. Additionally, we highlight the scalability of our method to 100 LLMs, ranging from 0.5B to 236B parameters, incurring no additional cost as the number of parameters increases. We have made our code publicly available at \url{https://github.com/chrisliu298/llm-unlearn-eco}.

翻译：大语言模型（LLMs）已发展到涵盖广泛领域的海量知识。然而，控制大语言模型不应掌握的知识对于确保对齐性及安全使用至关重要。但由于保留与遗忘之间的模糊边界可能造成附带损害，以及对具有数千亿参数的前沿模型进行优化所需的大量计算资源，从大语言模型中精确高效地遗忘知识仍面临挑战。本研究提出**嵌入损坏（ECO）提示**，一种面向大语言模型的轻量化遗忘框架，以同时解决知识纠缠与遗忘效率的难题。该方法不依赖大语言模型自身执行遗忘，而是通过部署提示分类器在推理阶段识别并隔离需遗忘的提示，从而强制模型进入遗忘状态。我们采用零阶优化方法离线学习针对遗忘目标添加到提示嵌入的损坏向量，并在推理过程中对分类器标记的提示实施嵌入损坏。研究发现，这些嵌入损坏的提示不仅能产生符合遗忘目标的理想输出，还能高度逼近从未在待遗忘数据上训练过的模型输出。通过大量遗忘实验，我们证明了该方法在通用领域及与遗忘领域密切相关的领域中，能以**近乎零副作用**实现显著遗忘效果的优势。此外，我们强调了该方法可扩展至100个参数量从0.5B到236B的大语言模型，且参数量增长不会带来额外成本。代码已公开于 \url{https://github.com/chrisliu298/llm-unlearn-eco}。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日