Membership Inference Attacks (MIAs) aim to distinguish training points (members) from unseen data (non-members), and are widely used to quantify memorization and assess privacy risks. Standard MIA evaluation requires repeated retraining, which is computationally costly for large models. One-run (single training with randomized data inclusion) and zero-run (post hoc evaluation) methods are often used instead, but their statistical validity remains unclear. We address this gap by framing MIA evaluation as a causal inference problem, defining \emph{memorization as the causal effect of including a data point in the training set}. This novel formulation reveals and formalizes key sources of bias in existing protocols: one-run methods suffer from interference between jointly included points, while zero-run evaluations are additionally confounded by distribution shift between member and non-member evaluation data. We derive causal analogues of standard MIA metrics and propose practical estimators for multi-run, one-run, and zero-run regimes with non-asymptotic consistency guarantees. We validate our approach in several settings, including pretrained and fine-tuned LLMs, showing that it enables reliable measurement of MIA performance without retraining and under distribution shift. Overall, our framework provides a principled foundation for privacy evaluation in modern AI systems.
翻译:成员推理攻击旨在区分训练数据点(成员)与未见数据(非成员),广泛用于量化记忆程度和评估隐私风险。标准成员推理攻击评估需要重复训练,这对大型模型而言计算成本高昂。实践中常采用单次运行(含随机数据包含的单次训练)和零次运行(事后评估)方法,但其统计有效性尚不明确。我们通过将成员推理攻击评估框架化为因果推断问题来填补这一空白,将记忆定义为训练集中包含某数据点产生的因果效应。这一新颖表述揭示并形式化了现有协议中的关键偏差来源:单次运行方法受联合包含数据点间的交互干扰,而零次运行评估还因成员与非成员评估数据间的分布偏移产生混杂偏差。我们推导了标准成员推理攻击指标的因果对应量,并提出适用于多次运行、单次运行和零次运行场景的实用估计器,这些估计器具有非渐近一致性保证。我们在包括预训练和微调大语言模型在内的多个场景中验证了该方法,表明其能够在无需重训练且存在分布偏移的情况下可靠测量成员推理攻击性能。总体而言,我们的框架为现代AI系统的隐私评估提供了原则性基础。