Concept erasure in language models has traditionally lacked a comprehensive evaluation framework, leading to incomplete assessments of effectiveness of erasure methods. We propose an evaluation paradigm centered on three critical criteria: innocence (complete knowledge removal), seamlessness (maintaining conditional fluent generation), and specificity (preserving unrelated task performance). Our evaluation metrics naturally motivate the development of Erasure of Language Memory (ELM), a new method designed to address all three dimensions. ELM employs targeted low-rank updates to alter output distributions for erased concepts while preserving overall model capabilities including fluency when prompted for an erased concept. We demonstrate ELM's efficacy on biosecurity, cybersecurity, and literary domain erasure tasks. Comparative analysis shows that ELM achieves superior performance across our proposed metrics, including near-random scores on erased topic assessments, generation fluency, maintained accuracy on unrelated benchmarks, and robustness under adversarial attacks. Our code, data, and trained models are available at https://elm.baulab.info
翻译:语言模型中的概念擦除传统上缺乏全面的评估框架,导致对擦除方法有效性的评估不够完整。我们提出一个以三个关键标准为核心的评估范式:纯洁性(完全的知识移除)、无缝性(保持条件流畅生成)和特异性(保留无关任务性能)。我们的评估指标自然地推动了语言记忆擦除(ELM)方法的开发,这是一种旨在同时解决所有三个维度问题的新方法。ELM采用有针对性的低秩更新来改变被擦除概念的输出分布,同时保留模型的整体能力,包括在被提示擦除概念时保持生成流畅性。我们在生物安全、网络安全和文学领域擦除任务上验证了ELM的有效性。对比分析表明,ELM在我们提出的所有评估指标上均表现出优越性能,包括在被擦除主题评估中获得接近随机的分数、生成流畅性、在无关基准测试中保持的准确性,以及在对抗攻击下的鲁棒性。我们的代码、数据和训练模型可在 https://elm.baulab.info 获取。