We study synthetic data release for answering multiple linear queries over a set of database tables in a differentially private way. Two special cases have been considered in the literature: how to release a synthetic dataset for answering multiple linear queries over a single table, and how to release the answer for a single counting (join size) query over a set of database tables. Compared to the single-table case, the join operator makes query answering challenging, since the sensitivity (i.e., by how much an individual data record can affect the answer) could be heavily amplified by complex join relationships. We present an algorithm for the general problem, and prove a lower bound illustrating that our general algorithm achieves parameterized optimality (up to logarithmic factors) on some simple queries (e.g., two-table join queries) in the most commonly-used privacy parameter regimes. For the case of hierarchical joins, we present a data partition procedure that exploits the concept of {\em uniformized sensitivities} to further improve the utility.
翻译:我们研究如何在差分隐私约束下,针对一组数据库表回答多个线性查询时生成合成数据的问题。文献中已探讨两种特例:如何针对单表上的多个线性查询发布合成数据集,以及如何针对一组数据库表上的单个计数(连接规模)查询发布答案。与单表情况相比,连接运算符使查询回答更具挑战性,因为单个数据记录对答案的影响程度(即敏感度)可能因复杂的连接关系而显著放大。我们针对该通用问题提出一种算法,并证明了下界,表明该通用算法在最常用的隐私参数设置下,对某些简单查询(如两表连接查询)可实现参数化最优性(至多对数因子误差)。针对层次连接情况,我们提出一种数据分区方法,通过利用"均匀化敏感度"概念进一步提升效用。