Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a large language model should not know is important for ensuring alignment and thus safe use. However, accurately and efficiently unlearning knowledge from an LLM remains challenging due to the potential collateral damage caused by the fuzzy boundary between retention and forgetting, and the large computational requirements for optimization across state-of-the-art models with hundreds of billions of parameters. In this work, we present \textbf{Embedding-COrrupted (ECO) Prompts}, a lightweight unlearning framework for large language models to address both the challenges of knowledge entanglement and unlearning efficiency. Instead of relying on the LLM itself to unlearn, we enforce an unlearned state during inference by employing a prompt classifier to identify and safeguard prompts to forget. We learn corruptions added to prompt embeddings via zeroth order optimization toward the unlearning objective offline and corrupt prompts flagged by the classifier during inference. We find that these embedding-corrupted prompts not only lead to desirable outputs that satisfy the unlearning objective but also closely approximate the output from a model that has never been trained on the data intended for forgetting. Through extensive experiments on unlearning, we demonstrate the superiority of our method in achieving promising unlearning at \textit{nearly zero side effects} in general domains and domains closely related to the unlearned ones. Additionally, we highlight the scalability of our method to 100 LLMs, ranging from 0.5B to 236B parameters, incurring no additional cost as the number of parameters increases. We have made our code publicly available at \url{https://github.com/chrisliu298/llm-unlearn-eco}.
翻译:大语言模型(LLMs)已发展到涵盖广泛领域的海量知识。然而,控制大语言模型不应掌握的知识对于确保对齐性及安全使用至关重要。但由于保留与遗忘之间的模糊边界可能造成附带损害,以及对具有数千亿参数的前沿模型进行优化所需的大量计算资源,从大语言模型中精确高效地遗忘知识仍面临挑战。本研究提出**嵌入损坏(ECO)提示**,一种面向大语言模型的轻量化遗忘框架,以同时解决知识纠缠与遗忘效率的难题。该方法不依赖大语言模型自身执行遗忘,而是通过部署提示分类器在推理阶段识别并隔离需遗忘的提示,从而强制模型进入遗忘状态。我们采用零阶优化方法离线学习针对遗忘目标添加到提示嵌入的损坏向量,并在推理过程中对分类器标记的提示实施嵌入损坏。研究发现,这些嵌入损坏的提示不仅能产生符合遗忘目标的理想输出,还能高度逼近从未在待遗忘数据上训练过的模型输出。通过大量遗忘实验,我们证明了该方法在通用领域及与遗忘领域密切相关的领域中,能以**近乎零副作用**实现显著遗忘效果的优势。此外,我们强调了该方法可扩展至100个参数量从0.5B到236B的大语言模型,且参数量增长不会带来额外成本。代码已公开于 \url{https://github.com/chrisliu298/llm-unlearn-eco}。