This paper investigates Who's Harry Potter (WHP), a pioneering yet insufficiently understood method for LLM unlearning. We explore it in two steps. First, we introduce a new task of LLM targeted unlearning, where given an unlearning target (e.g., a person) and some unlearning documents, we aim to unlearn only the information about the target, rather than everything in the unlearning documents. We further argue that a successful unlearning should satisfy criteria such as not outputting gibberish, not fabricating facts about the unlearning target, and not releasing factual information under jailbreak attacks. Second, we construct a causal intervention framework for targeted unlearning, where the knowledge of the unlearning target is modeled as a confounder between LLM input and output, and the unlearning process as a deconfounding process. This framework justifies and extends WHP, deriving a simple unlearning algorithm that includes WHP as a special case. Experiments on existing and new datasets show that our approach, without explicitly optimizing for the aforementioned criteria, achieves competitive performance in all of them. Our code is available at https://github.com/UCSB-NLP-Chang/causal_unlearn.git.
翻译:本文研究了《谁是哈利·波特》(WHP)这一开创性但尚未被充分理解的大语言模型(LLM)遗忘方法。我们从两个步骤展开探讨。首先,我们提出了LLM定向遗忘这一新任务:给定一个遗忘目标(例如某个人物)及若干遗忘文档,我们的目标是仅遗忘与该目标相关的信息,而非遗忘文档中的所有内容。我们进一步指出,成功的遗忘应满足以下标准:不输出无意义内容、不捏造关于遗忘目标的事实,以及在遭遇越狱攻击时不泄露真实信息。其次,我们构建了一个用于定向遗忘的因果干预框架,其中将遗忘目标的知识建模为LLM输入与输出之间的混杂因子,并将遗忘过程视为去混杂过程。该框架为WHP提供了理论依据并对其进行了扩展,推导出一个包含WHP作为特例的简洁遗忘算法。在现有数据集及新构建数据集上的实验表明,我们的方法无需显式优化上述标准,即可在所有评估指标上取得具有竞争力的性能。我们的代码公开于:https://github.com/UCSB-NLP-Chang/causal_unlearn.git。