Previous work has shown that Large Language Models are susceptible to so-called data extraction attacks. This allows an attacker to extract a sample that was contained in the training data, which has massive privacy implications. The construction of data extraction attacks is challenging, current attacks are quite inefficient, and there exists a significant gap in the extraction capabilities of untargeted attacks and memorization. Thus, targeted attacks are proposed, which identify if a given sample from the training data, is extractable from a model. In this work, we apply a targeted data extraction attack to the SATML2023 Language Model Training Data Extraction Challenge. We apply a two-step approach. In the first step, we maximise the recall of the model and are able to extract the suffix for 69% of the samples. In the second step, we use a classifier-based Membership Inference Attack on the generations. Our AutoSklearn classifier achieves a precision of 0.841. The full approach reaches a score of 0.405 recall at a 10% false positive rate, which is an improvement of 34% over the baseline of 0.301.
翻译:先前研究表明,大型语言模型容易遭受所谓的数据提取攻击。此类攻击使攻击者能够提取训练数据中包含的样本,这具有重大的隐私影响。数据提取攻击的构建具有挑战性,现有攻击的效率相当低下,且非定向攻击与记忆化之间的提取能力存在显著差距。为此,本文提出定向攻击方法,用于识别训练数据中给定样本是否可从模型中提取。在本工作中,我们将定向数据提取攻击应用于SATML2023语言模型训练数据提取挑战。我们采用两步法:第一步,最大化模型召回率,成功提取了69%样本的后缀;第二步,基于分类器对生成结果实施成员推断攻击。我们的AutoSklearn分类器实现了0.841的精确率。完整方法在10%假阳性率下达到0.405的召回率,较基线0.301提升了34%。