Membership Inference Attacks (MIAs) aim to distinguish training points (members) from unseen data (non-members), and are widely used to quantify memorization and assess privacy risks. Standard MIA evaluation requires repeated retraining, which is computationally costly for large models. One-run (single training with randomized data inclusion) and zero-run (post hoc evaluation) methods are often used instead, but their statistical validity remains unclear. We address this gap by framing MIA evaluation as a causal inference problem, defining \emph{memorization as the causal effect of including a data point in the training set}. This novel formulation reveals and formalizes key sources of bias in existing protocols: one-run methods suffer from interference between jointly included points, while zero-run evaluations are additionally confounded by distribution shift between member and non-member evaluation data. We derive causal analogues of standard MIA metrics and propose practical estimators for multi-run, one-run, and zero-run regimes with non-asymptotic consistency guarantees. We validate our approach in several settings, including pretrained and fine-tuned LLMs, showing that it enables reliable measurement of MIA performance without retraining and under distribution shift. Overall, our framework provides a principled foundation for privacy evaluation in modern AI systems.
翻译:成员推断攻击(Membership Inference Attacks, MIAs)旨在区分训练样本(成员)与未见数据(非成员),并广泛用于量化记忆化程度和评估隐私风险。标准MIA评估需要重复训练,对大型模型而言计算成本高昂。单次运行(随机化数据包含的单次训练)和零次运行(事后评估)方法常被用作替代方案,但其统计有效性尚不明确。我们通过将MIA评估框架化为因果推断问题来填补这一空白,将**记忆化定义为训练集中包含某个数据点的因果效应**。这一新颖的表述揭示并形式化了现有协议中的关键偏差来源:单次运行方法受限于联合包含点之间的干扰,而零次运行评估还受成员与非成员评估数据间分布偏移的混杂影响。我们推导了标准MIA指标的因果对应物,并为多次运行、单次运行和零次运行场景提出了具有非渐近一致性保证的实用估计量。我们在多种设置中验证了该方法,包括预训练和微调的大语言模型(LLMs),结果表明它能够在无需重训练和存在分布偏移的情况下可靠测量MIA性能。总体而言,我们的框架为现代AI系统的隐私评估提供了原则性基础。