Machine unlearning is a promising approach to mitigate undesirable memorization of training data in LLMs. However, in this work we show that existing approaches for unlearning in LLMs are surprisingly susceptible to a simple set of targeted relearning attacks. With access to only a small and potentially loosely related set of data, we find that we can "jog" the memory of unlearned models to reverse the effects of unlearning. For example, we show that relearning on public medical articles can lead an unlearned LLM to output harmful knowledge about bioweapons, and relearning general wiki information about the book series Harry Potter can force the model to output verbatim memorized text. We formalize this unlearning-relearning pipeline, explore the attack across three popular unlearning benchmarks, and discuss future directions and guidelines that result from our study.
翻译:机器遗忘是一种有前景的方法,用于减轻大语言模型对训练数据的不良记忆。然而,在本工作中,我们表明,现有的大语言模型遗忘方法惊人地容易受到一系列简单的目标化再学习攻击。仅通过访问一个规模较小且可能仅松散相关的数据集,我们发现可以“唤醒”已遗忘模型的记忆,从而逆转遗忘的效果。例如,我们证明,在公开的医学文章上进行再学习,可以导致一个已遗忘的大语言模型输出有关生物武器的有害知识;而在关于《哈利·波特》系列书籍的一般维基信息上进行再学习,则可以迫使模型输出逐字记忆的文本。我们形式化了这种遗忘-再学习的流程,在三个流行的遗忘基准测试中探索了该攻击,并讨论了从我们的研究中得出的未来方向和指导原则。