Statistical data anonymization increasingly relies on fully synthetic microdata, for which classical identity disclosure measures are less informative than an adversary's ability to infer sensitive attributes from released data. We introduce RAPID (Risk of Attribute Prediction--Induced Disclosure), a disclosure risk measure that directly quantifies inferential vulnerability under a realistic attack model. An adversary trains a predictive model solely on the released synthetic data and applies it to real individuals' quasi-identifiers. For continuous sensitive attributes, RAPID reports the proportion of records whose predicted values fall within a specified relative error tolerance. For categorical attributes, we propose a baseline-normalized confidence score that measures how much more confident the attacker is about the true class than would be expected from class prevalence alone, and we summarize risk as the fraction of records exceeding a policy-defined threshold. This construction yields an interpretable, bounded risk metric that is robust to class imbalance, independent of any specific synthesizer, and applicable with arbitrary learning algorithms. We illustrate threshold calibration, uncertainty quantification, and comparative evaluation of synthetic data generators using simulations and real data. Our results show that RAPID provides a practical, attacker-realistic upper bound on attribute-inference disclosure risk that complements existing utility diagnostics and disclosure control frameworks.
翻译:统计数据的匿名化日益依赖于完全合成的微观数据,对此类数据而言,传统的身份泄露度量指标相比攻击者从发布数据中推断敏感属性的能力,其信息量较低。本文提出RAPID(属性预测诱导泄露风险),这是一种在现实攻击模型下直接量化推断脆弱性的泄露风险度量方法。攻击者仅利用发布的合成数据训练预测模型,并将其应用于真实个体的准标识符。对于连续型敏感属性,RAPID报告预测值落在指定相对误差容忍度范围内的记录比例。对于分类属性,我们提出一种基线归一化置信度评分,用于衡量攻击者对真实类别的置信度相较于仅基于类别先验分布的预期置信度的提升程度,并通过超过政策设定阈值的记录比例来综合评估风险。该构建方法产生了一个可解释、有界且稳健的风险度量指标:对类别不平衡具有鲁棒性,独立于任何特定合成器,并适用于任意学习算法。我们通过模拟和真实数据演示了阈值校准、不确定性量化以及合成数据生成器的比较评估。结果表明,RAPID为属性推断泄露风险提供了一个实用且符合攻击者实际能力的上界,可作为现有效用诊断与泄露控制框架的有效补充。