Large language models (LLMs) have shown to pose social and ethical risks such as generating toxic language or facilitating malicious use of hazardous knowledge. Machine unlearning is a promising approach to improve LLM safety by directly removing harmful behaviors and knowledge. In this paper, we propose "SPlit, UNlearn, MerGE" (SPUNGE), a framework that can be used with any unlearning method to amplify its effectiveness. SPUNGE leverages data attributes during unlearning by splitting unlearning data into subsets based on specific attribute values, unlearning each subset separately, and merging the unlearned models. We empirically demonstrate that SPUNGE significantly improves the performance of two recent unlearning methods on state-of-the-art LLMs while maintaining their general capabilities on standard academic benchmarks.
翻译:大语言模型(LLMs)已被证明会带来社会和伦理风险,例如生成有害语言或助长危险知识的恶意使用。机器遗忘是一种通过直接移除有害行为和知识来提升大语言模型安全性的有效方法。本文提出“分割、遗忘、合并”(SPUNGE)框架,该框架可与任何遗忘方法结合使用以增强其有效性。SPUNGE在遗忘过程中利用数据属性,具体方法为:根据特定属性值将待遗忘数据划分为子集,分别对各子集进行遗忘操作,最后合并已遗忘的模型。我们通过实验证明,SPUNGE能显著提升两种前沿遗忘方法在先进大语言模型上的性能,同时保持模型在标准学术基准测试中的通用能力。