The task of "unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance for mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general methodology for eliciting directions in the parameter space (termed "concept vectors") that encode concrete concepts, and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors, while directly ablating these vectors demonstrably removes the associated knowledge from the LLMs and significantly reduces their susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parametric-based evaluations. To support this, we release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.
翻译:在大语言模型(LLMs)中“遗忘”特定概念的任务,因其对于缓解模型不良行为(如生成有害、隐私或错误信息)的重要性,近来引起了广泛关注。当前评估遗忘方法的方案主要依赖于行为测试,而未监测被遗忘知识在模型参数中的残留情况。这种残留知识可能被对抗性利用,以在遗忘后恢复已擦除的信息。我们认为,遗忘评估也应从内部进行,即通过考察被遗忘概念在参数知识痕迹中的变化来实现。为此,我们提出了一种通用方法,用于在参数空间中提取编码具体概念的方向(称为“概念向量”),并构建了ConceptVectors——一个包含数百个常见概念及其在两个开源LLM中参数知识痕迹的基准数据集。在ConceptVectors上的评估表明,现有遗忘方法对概念向量的影响微乎其微,而直接对这些向量进行消融则可证明地从LLM中移除了相关知识,并显著降低了其受对抗性操纵的脆弱性。我们的结果凸显了基于行为的遗忘评估的局限性,并呼吁未来工作纳入基于参数的评估。为此,我们在https://github.com/yihuaihong/ConceptVectors 发布了代码和基准。