The aim of this study is to investigate Machine Unlearning (MU), a burgeoning field focused on addressing concerns related to neural models inadvertently retaining personal or sensitive data. Here, a novel approach is introduced to achieve precise and selective forgetting within language models. Unlike previous methodologies that adopt completely opposing training objectives, this approach aims to mitigate adverse effects on language model performance, particularly in generation tasks. Furthermore, two innovative evaluation metrics are proposed: Sensitive Information Extraction Likelihood (S-EL) and Sensitive Information Memory Accuracy (S-MA), designed to gauge the effectiveness of sensitive information elimination. To reinforce the forgetting framework, an effective method for annotating sensitive scopes is presented, involving both online and offline strategies. The online selection mechanism leverages language probability scores to ensure computational efficiency, while the offline annotation entails a robust two-stage process based on Large Language Models (LLMs).
翻译:本研究旨在探讨机器遗忘(MU)这一新兴领域,该领域致力于解决神经网络模型无意中保留个人或敏感数据的问题。本文提出了一种新颖方法,旨在实现语言模型内的精确、选择性遗忘。与以往采用完全对立训练目标的方法不同,本方法旨在减轻对语言模型性能(尤其是在生成任务中)的不利影响。此外,本文提出了两种创新评估指标:敏感信息提取可能性(S-EL)和敏感信息记忆准确性(S-MA),旨在衡量敏感信息消除的有效性。为了强化遗忘框架,本文提出了一种有效的敏感范围标注方法,涉及在线和离线两种策略。在线选择机制利用语言概率分数以确保计算效率,而离线标注则包含基于大语言模型(LLMs)的稳健两阶段过程。