Large Language Models are trained on extensive datasets that often contain sensitive, human-generated information, raising significant concerns about privacy breaches. While certified unlearning approaches offer strong privacy guarantees, they rely on restrictive model assumptions that are not applicable to LLMs. As a result, various unlearning heuristics have been proposed, with the associated privacy risks assessed only empirically. The standard evaluation pipelines typically randomly select data for removal from the training set, apply unlearning techniques, and use membership inference attacks to compare the unlearned models against models retrained without the to-be-unlearned data. However, since every data point is subject to the right to be forgotten, unlearning should be considered in the worst-case scenario from the privacy perspective. Prior work shows that data outliers may exhibit higher memorization effects. Intuitively, they are harder to be unlearn and thus the privacy risk of unlearning them is underestimated in the current evaluation. In this paper, we leverage minority data to identify such a critical flaw in previously widely adopted evaluations. We substantiate this claim through carefully designed experiments, including unlearning canaries related to minority groups, inspired by privacy auditing literature. Using personally identifiable information as a representative minority identifier, we demonstrate that minority groups experience at least 20% more privacy leakage in most cases across six unlearning approaches, three MIAs, three benchmark datasets, and two LLMs of different scales. Given that the right to be forgotten should be upheld for every individual, we advocate for a more rigorous evaluation of LLM unlearning methods. Our minority-aware evaluation framework represents an initial step toward ensuring more equitable assessments of LLM unlearning efficacy.
翻译:大型语言模型在包含敏感人类生成信息的海量数据集上进行训练,这引发了关于隐私泄露的重大关切。虽然经过认证的遗忘方法提供了强有力的隐私保证,但它们依赖于对大型语言模型不适用的严格模型假设。因此,学界提出了多种启发式遗忘方法,其相关的隐私风险仅通过实证方式评估。标准评估流程通常从训练集中随机选择待删除数据,应用遗忘技术,并使用成员推理攻击将遗忘后的模型与未包含待遗忘数据重新训练的模型进行比较。然而,由于每个数据点都享有被遗忘权,从隐私角度应在最坏情况下考量遗忘效果。先前研究表明,数据异常值可能表现出更强的记忆效应。直观而言,这些数据更难被遗忘,因此在当前评估中其遗忘过程的隐私风险被低估。本文利用少数群体数据揭示了先前广泛采用的评估方法中存在的关键缺陷。我们通过精心设计的实验证实了这一论断,包括受隐私审计文献启发、针对少数群体的遗忘探针测试。以个人可识别信息作为代表性少数群体标识符,我们在六种遗忘方法、三种成员推理攻击、三个基准数据集和两种不同规模的大型语言模型上证明:在大多数情况下,少数群体遭受的隐私泄露至少高出20%。鉴于每个个体都应享有被遗忘权,我们主张对大型语言模型遗忘方法进行更严格的评估。本文提出的少数群体感知评估框架,是朝着确保更公平评估大型语言模型遗忘效能迈出的初步步骤。