Neural language models (LMs) are vulnerable to training data extraction attacks due to data memorization. This paper introduces a novel attack scenario wherein an attacker adversarially fine-tunes pre-trained LMs to amplify the exposure of the original training data. This strategy differs from prior studies by aiming to intensify the LM's retention of its pre-training dataset. To achieve this, the attacker needs to collect generated texts that are closely aligned with the pre-training data. However, without knowledge of the actual dataset, quantifying the amount of pre-training data within generated texts is challenging. To address this, we propose the use of pseudo-labels for these generated texts, leveraging membership approximations indicated by machine-generated probabilities from the target LM. We subsequently fine-tune the LM to favor generations with higher likelihoods of originating from the pre-training data, based on their membership probabilities. Our empirical findings indicate a remarkable outcome: LMs with over 1B parameters exhibit a four to eight-fold increase in training data exposure. We discuss potential mitigations and suggest future research directions.
翻译:由于数据记忆特性,神经语言模型(LM)易受训练数据提取攻击。本文提出一种新颖的攻击场景:攻击者通过对预训练LM进行对抗性微调,以放大原始训练数据的暴露风险。该策略与先前研究的区别在于其旨在强化LM对预训练数据集的记忆保留。为实现此目标,攻击者需收集与预训练数据高度契合的生成文本。然而,在缺乏实际数据集知识的情况下,量化生成文本中包含的预训练数据量具有挑战性。为此,我们提出对生成文本使用伪标签方法,利用目标LM的机器生成概率所指示的成员近似度。随后,我们基于成员概率对LM进行微调,使其更倾向于生成来自预训练数据可能性更高的文本。实验结果表明:参数超过10亿的LM其训练数据暴露量呈现四至八倍的增长。本文进一步探讨了潜在缓解措施,并提出了未来研究方向。