Neural language models (LMs) are vulnerable to training data extraction attacks due to data memorization. This paper introduces a novel attack scenario wherein an attacker adversarially fine-tunes pre-trained LMs to amplify the exposure of the original training data. This strategy differs from prior studies by aiming to intensify the LM's retention of its pre-training dataset. To achieve this, the attacker needs to collect generated texts that are closely aligned with the pre-training data. However, without knowledge of the actual dataset, quantifying the amount of pre-training data within generated texts is challenging. To address this, we propose the use of pseudo-labels for these generated texts, leveraging membership approximations indicated by machine-generated probabilities from the target LM. We subsequently fine-tune the LM to favor generations with higher likelihoods of originating from the pre-training data, based on their membership probabilities. Our empirical findings indicate a remarkable outcome: LMs with over 1B parameters exhibit a four to eight-fold increase in training data exposure. We discuss potential mitigations and suggest future research directions.
翻译:神经语言模型因数据记忆而易受训练数据提取攻击。本文提出一种新型攻击场景:攻击者通过对抗性微调预训练语言模型,放大原始训练数据的暴露程度。与以往研究不同,该策略旨在增强语言模型对其预训练数据集的记忆保持。为此,攻击者需收集与预训练数据高度一致的生成文本。然而,在未知真实数据集的情况下,量化生成文本中预训练数据的含量颇具挑战。针对这一问题,我们提出为这些生成文本使用伪标签,利用目标语言模型给出的机器生成概率来近似成员关系。随后,我们基于这些生成文本的成员概率,微调语言模型使其更倾向于生成那些更可能源自预训练数据的文本。实验结果表明显著成效:参数超过10亿的语言模型,其训练数据暴露量增加了四到八倍。本文讨论了潜在的防御措施,并提出了未来研究方向。