Synthetic data generation is a promising technique to facilitate the use of sensitive data while mitigating the risk of privacy breaches. However, for synthetic data to be useful in downstream analysis tasks, it needs to be of sufficient quality. Various methods have been proposed to measure the utility of synthetic data, but their results are often incomplete or even misleading. In this paper, we propose using density ratio estimation to improve quality evaluation for synthetic data, and thereby the quality of synthesized datasets. We show how this framework relates to and builds on existing measures, yielding global and local utility measures that are informative and easy to interpret. We develop an estimator which requires little to no manual tuning due to automatic selection of a nonparametric density ratio model. Through simulations, we find that density ratio estimation yields more accurate estimates of global utility than established procedures. A real-world data application demonstrates how the density ratio can guide refinements of synthesis models and can be used to improve downstream analyses. We conclude that density ratio estimation is a valuable tool in synthetic data generation workflows and provide these methods in the accessible open source R-package densityratio.
翻译:合成数据生成是一种有前景的技术,能够在降低隐私泄露风险的同时促进敏感数据的使用。然而,要使合成数据在下游分析任务中具有实用性,其必须具备足够的质量。尽管已有多种方法被提出用于衡量合成数据的效用,但其结果往往不完整甚至具有误导性。本文提出利用密度比估计来改进合成数据的质量评估,进而提升合成数据集的质量。我们展示了该框架如何关联并建立在现有度量方法之上,从而产生信息丰富且易于解释的全局与局部效用度量。我们开发了一种估计器,由于能够自动选择非参数密度比模型,该估计器几乎不需要手动调参。通过模拟实验,我们发现密度比估计相比既定方法能更准确地评估全局效用。一项真实世界的数据应用展示了密度比如何指导合成模型的改进,并可用于优化下游分析。我们的结论是,密度比估计是合成数据生成工作流中的一个重要工具,相关方法已集成于开源R包densityratio中供用户使用。