This paper explains how the synthpop package for R has been extended to include functions to calculate measures of identity and attribute disclosure risk for synthetic data that measure risks for the records used to create the synthetic data. The basic function, disclosure, calculates identity disclosure for a set of quasi-identifiers (keys) and attribute disclosure for one variable specified as a target from the same set of keys. The second function, disclosure.summary, is a wrapper for the first and presents summary results for a set of targets. This short paper explains the measures of disclosure risk and documents how they are calculated. We recommend two measures: $RepU$ (replicated uniques) for identity disclosure and $DiSCO$ (Disclosive in Synthetic Correct Original) for attribute disclosure. Both are expressed a \% of the original records and each can be compared to similar measures calculated from the original data. Experience with using the functions on real data found that some apparent disclosures could be identified as coming from relationships in the data that would be expected to be known to anyone familiar with its features. We flag cases when this seems to have occurred and provide means of excluding them.
翻译:本文阐述了如何扩展R语言的synthpop包,使其包含计算合成数据身份与属性泄露风险的函数,这些风险度量针对用于生成合成数据的原始记录。基础函数disclosure可计算一组准标识符(密钥)的身份泄露风险,以及从同一组密钥中指定为目标变量的属性泄露风险。第二个函数disclosure.summary是前者的封装函数,可针对一组目标变量呈现汇总结果。这篇短文阐释了泄露风险的度量方法,并记录了其计算过程。我们推荐两种度量指标:用于身份泄露的$RepU$(重复唯一值)和用于属性泄露的$DiSCO$(合成数据中正确原始数据的可泄露性)。两者均以原始记录的百分比表示,且均可与基于原始数据计算的同类度量指标进行比较。在实际数据应用中发现,某些表面上的泄露可被识别为源于数据内在关联,这些关联对于熟悉数据特征的研究者而言是可预期的。我们会对疑似此类情况作出标记,并提供相应的排除方法。