Synthetic data generation methods, and in particular, private synthetic data generation methods, are gaining popularity as a means to make copies of sensitive databases that can be shared widely for research and data analysis. Some of the fundamental operations in data analysis include analyzing aggregated statistics, e.g., count, sum, or median, on a subset of data satisfying some conditions. When synthetic data is generated, users may be interested in knowing if their aggregated queries generating such statistics can be reliably answered on the synthetic data, for instance, to decide if the synthetic data is suitable for specific tasks. However, the standard data generation systems do not provide "per-query" quality guarantees on the synthetic data, and the users have no way of knowing how much the aggregated statistics on the synthetic data can be trusted. To address this problem, we present a novel framework named DP-PQD (differentially-private per-query decider) to detect if the query answers on the private and synthetic datasets are within a user-specified threshold of each other while guaranteeing differential privacy. We give a suite of private algorithms for per-query deciders for count, sum, and median queries, analyze their properties, and evaluate them experimentally.
翻译:摘要:合成数据生成方法,特别是私有合成数据生成方法,正日益普及,用于制作敏感数据库的副本,以便广泛共享用于研究和数据分析。数据分析中的基本操作包括分析满足某些条件的数据子集的聚合统计数据,如计数、总和或中位数。当生成合成数据时,用户可能希望了解其用于生成此类统计数据的聚合查询是否能在合成数据上可靠地回答,例如,以判断合成数据是否适用于特定任务。然而,标准数据生成系统不提供对合成数据的“逐查询”质量保证,用户无法知道合成数据上的聚合统计数据在多大程度上可信。为解决此问题,我们提出一个名为DP-PQD(差分隐私逐查询判定器)的新框架,用于检测私有数据集与合成数据集上的查询答案是否在用户指定的阈值内,同时保证差分隐私。我们针对计数、总和及中位数查询的逐查询判定器,提供一套私有算法,分析其性质,并通过实验进行评估。