探索大型语言模型强成员推理攻击的极限 (Exploring the limits of strong membership inference attacks on large language models)

Jamie Hayes,Ilia Shumailov,Christopher A. Choquette-Choo,Matthew Jagielski,George Kaissis,Milad Nasr,Sahra Ghalebikesabi,Meenatchi Sundaram Mutu Selva Annamalai,Niloofar Mireshghallah,Igor Shilov,Matthieu Meeus,Yves-Alexandre de Montjoye,Katherine Lee,Franziska Boenisch,Adam Dziedzic,A. Feder Cooper

from arxiv, NeurIPS 2025

State-of-the-art membership inference attacks (MIAs) typically require training many reference models, making it difficult to scale these attacks to large pre-trained language models (LLMs). As a result, prior research has either relied on weaker attacks that avoid training references (e.g., fine-tuning attacks), or on stronger attacks applied to small models and datasets. However, weaker attacks have been shown to be brittle and insights from strong attacks in simplified settings do not translate to today's LLMs. These challenges prompt an important question: are the limitations observed in prior work due to attack design choices, or are MIAs fundamentally ineffective on LLMs? We address this question by scaling LiRA--one of the strongest MIAs--to GPT-2 architectures ranging from 10M to 1B parameters, training references on over 20B tokens from the C4 dataset. Our results advance the understanding of MIAs on LLMs in four key ways. While (1) strong MIAs can succeed on pre-trained LLMs, (2) their effectiveness, remains limited (e.g., AUC<0.7) in practical settings. (3) Even when strong MIAs achieve better-than-random AUC, aggregate metrics can conceal substantial per-sample MIA decision instability: due to training randomness, many decisions are so unstable that they are statistically indistinguishable from a coin flip. Finally, (4) the relationship between MIA success and related LLM privacy metrics is not as straightforward as prior work has suggested.

翻译：现有最先进的成员推理攻击通常需要训练大量参考模型，这使得此类攻击难以扩展至大型预训练语言模型。因此，先前研究要么依赖无需训练参考模型的较弱攻击方法（如微调攻击），要么只能在小型模型和数据集上实施强攻击。然而，较弱攻击已被证明具有脆弱性，而在简化场景中强攻击获得的洞见并不能直接迁移至当今的大型语言模型。这些挑战引出了一个关键问题：先前工作中观察到的局限性是源于攻击设计选择，还是成员推理攻击本质上对大型语言模型无效？为解答此问题，我们将当前最强的成员推理攻击方法之一——LiRA——扩展至参数量从1000万到10亿不等的GPT-2架构，并基于C4数据集中超过200亿词元训练参考模型。我们的研究从四个关键维度推进了对大型语言模型成员推理攻击的理解：虽然（1）强成员推理攻击能够在预训练大型语言模型上取得成功，但（2）其在实际场景中的有效性仍然有限（例如AUC<0.7）；（3）即使强成员推理攻击能获得优于随机水平的AUC值，聚合指标可能掩盖显著的逐样本攻击决策不稳定性：由于训练随机性，许多决策的不稳定程度使其在统计上与随机抛硬币无异；最后（4）成员推理攻击成功率与相关大型语言模型隐私度量指标之间的关系，并不像先前研究所暗示的那样简单直接。