In this paper we develop state-of-the-art privacy attacks against Large Language Models (LLMs), where an adversary with some access to the model tries to learn something about the underlying training data. Our headline results are new membership inference attacks (MIAs) against pretrained LLMs that perform hundreds of times better than baseline attacks, and a pipeline showing that over 50% (!) of the fine-tuning dataset can be extracted from a fine-tuned LLM in natural settings. We consider varying degrees of access to the underlying model, pretraining and fine-tuning data, and both MIAs and training data extraction. For pretraining data, we propose two new MIAs: a supervised neural network classifier that predicts training data membership on the basis of (dimensionality-reduced) model gradients, as well as a variant of this attack that only requires logit access to the model by leveraging recent model-stealing work on LLMs. To our knowledge this is the first MIA that explicitly incorporates model-stealing information. Both attacks outperform existing black-box baselines, and our supervised attack closes the gap between MIA attack success against LLMs and the strongest known attacks for other machine learning models. In fine-tuning, we find that a simple attack based on the ratio of the loss between the base and fine-tuned models is able to achieve near-perfect MIA performance; we then leverage our MIA to extract a large fraction of the fine-tuning dataset from fine-tuned Pythia and Llama models. Our code is available at github.com/safr-ai-lab/pandora-llm.
翻译:本文针对大型语言模型(LLMs)开发了最先进的隐私攻击方法,攻击者通过获取模型的部分访问权限,试图推断底层训练数据的信息。我们的核心成果包括:针对预训练LLMs的新型成员推理攻击(MIAs),其性能较基线攻击提升数百倍;以及一个完整流程,证明在自然场景下可从微调后的LLM中提取超过50%(!)的微调数据集。我们考虑了攻击者对底层模型、预训练与微调数据的不同访问权限,同时涵盖成员推理攻击和训练数据提取两类场景。针对预训练数据,我们提出两种新型MIAs:基于(降维后的)模型梯度预测训练数据成员性的监督神经网络分类器,以及仅需模型logit输出的变体攻击——该变体通过利用近期LLMs模型窃取研究成果实现。据我们所知,这是首个显式融合模型窃取信息的MIA方法。两种攻击均优于现有黑盒基线,其中监督式攻击显著缩小了LLMs的MIA成功率与其他机器学习模型最强已知攻击之间的差距。在微调场景中,我们发现基于基础模型与微调模型损失比值的简单攻击即可实现接近完美的MIA性能;随后利用该MIA方法从微调后的Pythia和Llama模型中成功提取大量微调数据集。代码已开源:github.com/safr-ai-lab/pandora-llm。