Membership inference attacks (MIAs) have emerged as the standard tool for evaluating the privacy risks of AI models. However, state-of-the-art attacks require training numerous, often computationally expensive, reference models, limiting their practicality. We present a novel approach for estimating model-level vulnerability, the TPR at low FPR, to membership inference attacks without requiring reference models. Empirical analysis shows loss distributions to be asymmetric and heavy-tailed and suggests that most points at risk from MIAs have moved from the tail (high-loss region) to the head (low-loss region) of the distribution after training. We leverage this insight to propose a method to estimate model-level vulnerability from the training and testing distribution alone: using the absence of outliers from the high-loss region as a predictor of the risk. We evaluate our method, the TNR of a simple loss attack, across a wide range of architectures and datasets and show it to accurately estimate model-level vulnerability to the SOTA MIA attack (LiRA). We also show our method to outperform both low-cost (few reference models) attacks such as RMIA and other measures of distribution difference. We finally evaluate the use of non-linear functions to evaluate risk and show the approach to be promising to evaluate the risk in large-language models.
翻译:成员推理攻击已成为评估人工智能模型隐私风险的标准工具。然而,最先进的攻击方法需要训练大量且通常计算成本高昂的参考模型,这限制了其实用性。我们提出了一种无需参考模型即可估计模型级脆弱性(即低误报率下的真阳性率)的新方法。实证分析表明损失分布具有不对称性和重尾特性,并指出训练后大多数面临成员推理攻击风险的数据点已从分布的尾部(高损失区域)转移至头部(低损失区域)。基于这一发现,我们提出一种仅通过训练和测试分布来估计模型级脆弱性的方法:利用高损失区域异常值的缺失作为风险预测指标。我们在多种架构和数据集上评估了该方法(即简单损失攻击的真阴性率),结果表明其能准确估计模型对最先进成员推理攻击(LiRA)的脆弱性。实验还证明该方法优于低成本(少量参考模型)攻击如RMIA及其他分布差异度量方法。最后,我们评估了非线性函数在风险评估中的应用,并证明该方法在评估大语言模型风险方面具有良好前景。