Large Language Model Unlearning via Embedding-Corrupted Prompts

Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a large language model should not know is important for ensuring alignment and thus safe use. However, accurately and efficiently unlearning knowledge from an LLM remains challenging due to the potential collateral damage caused by the fuzzy boundary between retention and forgetting, and the large computational requirements for optimization across state-of-the-art models with hundreds of billions of parameters. In this work, we present Embedding-COrrupted (ECO) Prompts, a lightweight unlearning framework for large language models to address both the challenges of knowledge entanglement and unlearning efficiency. Instead of relying on the LLM itself to unlearn, we enforce an unlearned state during inference by employing a prompt classifier to identify and safeguard prompts to forget. We learn corruptions added to prompt embeddings via zeroth order optimization toward the unlearning objective offline and corrupt prompts flagged by the classifier during inference. We find that these embedding-corrupted prompts not only lead to desirable outputs that satisfy the unlearning objective but also closely approximate the output from a model that has never been trained on the data intended for forgetting. Through extensive experiments on unlearning, we demonstrate the superiority of our method in achieving promising unlearning at nearly zero side effects in general domains and domains closely related to the unlearned ones. Additionally, we highlight the scalability of our method to 100 LLMs, ranging from 0.5B to 236B parameters, incurring no additional cost as the number of parameters increases.

翻译：大语言模型（LLMs）已发展到涵盖多领域的广泛知识。然而，控制大语言模型不应掌握的知识对于确保对齐性及安全使用至关重要。但由于保留与遗忘之间模糊边界可能导致的连带损害，以及对具有数千亿参数的最先进模型进行优化所需的大量计算资源，从大语言模型中准确高效地遗忘知识仍具挑战性。本研究提出嵌入污染（ECO）提示，一种轻量级的大语言模型遗忘框架，以同时解决知识纠缠与遗忘效率的难题。该方法不依赖大语言模型自身进行遗忘，而是在推理阶段通过提示分类器识别并隔离需遗忘的提示，从而强制模型进入遗忘状态。我们通过零阶优化离线学习添加到提示嵌入中的污染扰动，使其朝向遗忘目标优化，并在推理时对分类器标记的提示进行嵌入污染。研究发现，这些嵌入污染的提示不仅能产生符合遗忘目标的理想输出，还能高度近似于从未在待遗忘数据上训练过的模型输出。通过大量遗忘实验，我们证明了该方法在通用领域及与遗忘领域密切相关的领域中，能以近乎零副作用实现优异的遗忘效果。此外，我们强调了该方法可扩展至100个参数规模从0.5B到236B的大语言模型，且参数数量增加不会带来额外成本。