Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces

The task of "unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance in mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general evaluation methodology that leverages vocabulary projections to inspect concepts encoded in model parameters. We use this approach to localize "concept vectors" - parameter vectors that encode concrete concepts - and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors and mostly suppress them during inference, while directly ablating these vectors demonstrably removes the associated knowledge and significantly reduces the model's susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parameter-based evaluations. To support this, we release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.

翻译：在大语言模型（LLMs）中“遗忘”特定概念的任务近期受到极大关注，因其对于缓解模型不良行为（如生成有害、隐私或错误信息）具有重要意义。当前评估遗忘方法的方案主要依赖行为测试，而未监测遗忘知识在模型参数中的残留。这种残留知识可能被对抗性利用以恢复遗忘后的已擦除信息。我们认为遗忘评估也应从内部进行，即通过考察被遗忘概念在参数知识痕迹中的变化。为此，我们提出一种通用评估方法，利用词汇投影技术检测模型参数中编码的概念。通过该方法，我们定位了“概念向量”——编码具体概念的参数向量——并构建了ConceptVectors基准数据集，其中包含数百个常见概念及其在两个开源LLM中的参数知识痕迹。在ConceptVectors上的评估表明，现有遗忘方法对概念向量的影响甚微，主要是在推理过程中抑制其激活；而直接对这些向量进行消融则可证明能移除相关知识，并显著降低模型受对抗性操纵的脆弱性。我们的研究结果揭示了基于行为的遗忘评估的局限性，呼吁未来工作纳入基于参数的评估。为支持相关研究，我们在https://github.com/yihuaihong/ConceptVectors发布了代码与基准数据。