Large language models (LLMs) acquire a large amount of knowledge through pre-training on vast and diverse corpora. While this endows LLMs with strong capabilities in generation and reasoning, it amplifies risks associated with sensitive, copyrighted, or harmful content in training data. LLM unlearning, which aims to remove specific knowledge encoded within models, is a promising technique to reduce these risks. However, existing LLM unlearning methods often force LLMs to generate random or incoherent answers due to their inability to alter the encoded knowledge precisely. To achieve effective unlearning at the knowledge level of LLMs, we propose Knowledge Unlearning by Deviating representAtion (KUDA). We first utilize causal tracing to locate specific layers for target knowledge storage. We then design a new unlearning objective that induces the model's representations to deviate from its original position in the phase of knowledge removal, thus disrupting the ability to associate with the target knowledge. To resolve the optimization conflicts between forgetting and retention, we employ a relaxation null-space projection mechanism to mitigate the disruption to the representation space of retaining knowledge. Extensive experiments on representative benchmarks, WMDP and MUSE, demonstrate that KUDA outperforms most existing baselines by effectively balancing knowledge removal and model utility retention.
翻译:大语言模型通过在广泛多样的语料上进行预训练,获取了大量知识。这虽然赋予了大语言模型强大的生成与推理能力,但也放大了训练数据中敏感、受版权保护或有害内容所带来的风险。大语言模型遗忘技术旨在移除模型内部编码的特定知识,是降低此类风险的一种有前景的方法。然而,现有的大语言模型遗忘方法往往因其无法精确改变编码知识,而迫使模型生成随机或不连贯的答案。为了在大语言模型的知识层面实现有效遗忘,我们提出了通过表征偏离实现知识遗忘的方法。我们首先利用因果追踪技术定位存储目标知识的特定网络层。接着,我们设计了一种新的遗忘目标,在知识移除阶段诱导模型的表征偏离其原始位置,从而破坏其与目标知识关联的能力。为了解决遗忘与保留之间的优化冲突,我们采用了一种松弛零空间投影机制,以减轻对保留知识表征空间的干扰。在WMDP和MUSE这两个代表性基准上进行的大量实验表明,KUDA在有效平衡知识移除与模型效用保留方面,优于大多数现有基线方法。