Background: Synthetic data has been proposed as a solution for sharing anonymized versions of sensitive biomedical datasets. Ideally, synthetic data should preserve the structure and statistical properties of the original data, while protecting the privacy of the individual subjects. Differential privacy (DP) is currently considered the gold standard approach for balancing this trade-off. Objectives: To investigate the reliability of group differences identified by independent sample tests on DP-synthetic data. The evaluation is conducted in terms of the tests' Type I and Type II errors. The former quantifies the tests' validity i.e. whether the probability of false discoveries is indeed below the significance level, and the latter indicates the tests' power in making real discoveries. Methods: We evaluate the Mann-Whitney U test, Student's t-test, chi-squared test and median test on DP-synthetic data. The private synthetic datasets are generated from real-world data, including a prostate cancer dataset (n=500) and a cardiovascular dataset (n=70 000), as well as on bivariate and multivariate simulated data. Five different DP-synthetic data generation methods are evaluated, including two basic DP histogram release methods and MWEM, Private-PGM, and DP GAN algorithms. Conclusion: A large portion of the evaluation results expressed dramatically inflated Type I errors, especially at privacy budget levels of $\epsilon\leq 1$. This result calls for caution when releasing and analyzing DP-synthetic data: low p-values may be obtained in statistical tests simply as a byproduct of the noise added to protect privacy. A DP smoothed histogram-based synthetic data generation method was shown to produce valid Type I error for all privacy levels tested but required a large original dataset size and a modest privacy budget ($\epsilon\geq 5$) in order to have reasonable Type II error.
翻译:背景:合成数据已被提议作为共享敏感生物医学数据集匿名化版本的一种解决方案。理想情况下,合成数据应在保护个体隐私的同时,保持原始数据的结构和统计特性。差分隐私(DP)目前被认为是平衡这一权衡的黄金标准方法。目标:研究在DP合成数据上通过独立样本检验识别的组间差异的可靠性。评估从检验的I类错误和II类错误角度进行。前者量化检验的有效性,即错误发现的概率是否确实低于显著性水平;后者则表明检验发现真实差异的效力。方法:我们在DP合成数据上评估Mann-Whitney U检验、Student t检验、卡方检验和中位数检验。私有合成数据集从真实世界数据生成,包括一个前列腺癌数据集(n=500)和一个心血管数据集(n=70,000),以及双变量和多变量模拟数据。评估了五种不同的DP合成数据生成方法,包括两种基本的DP直方图发布方法,以及MWEM、Private-PGM和DP GAN算法。结论:大部分评估结果显示出I类错误率急剧膨胀,尤其是在隐私预算水平为$\epsilon\leq 1$时。这一结果提示在发布和分析DP合成数据时需要谨慎:统计检验中获得的低p值可能仅仅是保护隐私所添加噪声的副产品。一种基于DP平滑直方图的合成数据生成方法在所有测试的隐私水平下均能产生有效的I类错误,但需要较大的原始数据集规模和适中的隐私预算($\epsilon\geq 5$)才能获得合理的II类错误。