Erasing Conceptual Knowledge from Language Models

Concept erasure in language models has traditionally lacked a comprehensive evaluation framework, leading to incomplete assessments of effectiveness of erasure methods. We propose an evaluation paradigm centered on three critical criteria: innocence (complete knowledge removal), seamlessness (maintaining conditional fluent generation), and specificity (preserving unrelated task performance). Our evaluation metrics naturally motivate the development of Erasure of Language Memory (ELM), a new method designed to address all three dimensions. ELM employs targeted low-rank updates to alter output distributions for erased concepts while preserving overall model capabilities including fluency when prompted for an erased concept. We demonstrate ELM's efficacy on biosecurity, cybersecurity, and literary domain erasure tasks. Comparative analysis shows that ELM achieves superior performance across our proposed metrics, including near-random scores on erased topic assessments, generation fluency, maintained accuracy on unrelated benchmarks, and robustness under adversarial attacks. Our code, data, and trained models are available at https://elm.baulab.info

翻译：语言模型中的概念擦除传统上缺乏全面的评估框架，导致对擦除方法有效性的评估不够完整。我们提出一个以三个关键标准为核心的评估范式：纯洁性（完全的知识移除）、无缝性（保持条件流畅生成）和特异性（保留无关任务性能）。我们的评估指标自然地推动了语言记忆擦除（ELM）方法的开发，这是一种旨在同时解决所有三个维度问题的新方法。ELM采用有针对性的低秩更新来改变被擦除概念的输出分布，同时保留模型的整体能力，包括在被提示擦除概念时保持生成流畅性。我们在生物安全、网络安全和文学领域擦除任务上验证了ELM的有效性。对比分析表明，ELM在我们提出的所有评估指标上均表现出优越性能，包括在被擦除主题评估中获得接近随机的分数、生成流畅性、在无关基准测试中保持的准确性，以及在对抗攻击下的鲁棒性。我们的代码、数据和训练模型可在 https://elm.baulab.info 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/