Membership Inference Attacks (MIAs) aim to distinguish training points (members) from unseen data (non-members), and are widely used to quantify memorization and assess privacy risks. Standard MIA evaluation requires repeated retraining, which is computationally costly for large models. One-run (single training with randomized data inclusion) and zero-run (post hoc evaluation) methods are often used instead, but their statistical validity remains unclear. We address this gap by framing MIA evaluation as a causal inference problem, defining \emph{memorization as the causal effect of including a data point in the training set}. This novel formulation reveals and formalizes key sources of bias in existing protocols: one-run methods suffer from interference between jointly included points, while zero-run evaluations are additionally confounded by distribution shift between member and non-member evaluation data. We derive causal analogues of standard MIA metrics and propose practical estimators for multi-run, one-run, and zero-run regimes with non-asymptotic consistency guarantees. We validate our approach in several settings, including pretrained and fine-tuned LLMs, showing that it enables reliable measurement of MIA performance without retraining and under distribution shift. Overall, our framework provides a principled foundation for privacy evaluation in modern AI systems.
翻译:成员推断攻击旨在区分训练数据(成员)与未见数据(非成员),广泛应用于量化记忆程度与评估隐私风险。标准MIA评估需要重复训练,这对大型模型而言计算成本高昂。为规避此问题,常采用单轮方法(通过随机化数据包含的单次训练)与零轮方法(事后评估),但其统计有效性仍不明确。我们通过将MIA评估建模为因果推断问题来填补这一空白,将*记忆定义为包含某数据点进入训练集的因果效应*。这一新框架揭示了现有协议中关键偏差的根源并使其形式化:单轮方法受联合包含数据点间的干扰效应影响,而零轮评估还额外受到成员与非成员评估数据间分布偏移的混淆。我们推导出标准MIA指标的因果类比物,并提出适用于多轮、单轮与零轮方案的实用估计器,且具备非渐进一致性保证。我们在包括预训练与微调大语言模型在内的多个场景中验证了该方法,证明其能在无需重训练且存在分布偏移的情况下可靠测量MIA性能。总体而言,本框架为现代AI系统的隐私评估提供了原则性基础。