Pandora's White-Box: Precise Training Data Detection and Extraction in Large Language Models

In this paper we develop state-of-the-art privacy attacks against Large Language Models (LLMs), where an adversary with some access to the model tries to learn something about the underlying training data. Our headline results are new membership inference attacks (MIAs) against pretrained LLMs that perform hundreds of times better than baseline attacks, and a pipeline showing that over 50% (!) of the fine-tuning dataset can be extracted from a fine-tuned LLM in natural settings. We consider varying degrees of access to the underlying model, pretraining and fine-tuning data, and both MIAs and training data extraction. For pretraining data, we propose two new MIAs: a supervised neural network classifier that predicts training data membership on the basis of (dimensionality-reduced) model gradients, as well as a variant of this attack that only requires logit access to the model by leveraging recent model-stealing work on LLMs. To our knowledge this is the first MIA that explicitly incorporates model-stealing information. Both attacks outperform existing black-box baselines, and our supervised attack closes the gap between MIA attack success against LLMs and the strongest known attacks for other machine learning models. In fine-tuning, we find that a simple attack based on the ratio of the loss between the base and fine-tuned models is able to achieve near-perfect MIA performance; we then leverage our MIA to extract a large fraction of the fine-tuning dataset from fine-tuned Pythia and Llama models. Our code is available at github.com/safr-ai-lab/pandora-llm.

翻译：本文针对大型语言模型（LLMs）开发了最先进的隐私攻击方法，攻击者通过获取模型的部分访问权限，试图推断底层训练数据的信息。我们的核心成果包括：针对预训练LLMs的新型成员推理攻击（MIAs），其性能较基线攻击提升数百倍；以及一个完整流程，证明在自然场景下可从微调后的LLM中提取超过50%（！）的微调数据集。我们考虑了攻击者对底层模型、预训练与微调数据的不同访问权限，同时涵盖成员推理攻击和训练数据提取两类场景。针对预训练数据，我们提出两种新型MIAs：基于（降维后的）模型梯度预测训练数据成员性的监督神经网络分类器，以及仅需模型logit输出的变体攻击——该变体通过利用近期LLMs模型窃取研究成果实现。据我们所知，这是首个显式融合模型窃取信息的MIA方法。两种攻击均优于现有黑盒基线，其中监督式攻击显著缩小了LLMs的MIA成功率与其他机器学习模型最强已知攻击之间的差距。在微调场景中，我们发现基于基础模型与微调模型损失比值的简单攻击即可实现接近完美的MIA性能；随后利用该MIA方法从微调后的Pythia和Llama模型中成功提取大量微调数据集。代码已开源：github.com/safr-ai-lab/pandora-llm。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日