To analyze the privacy guarantee of personal data in a database that is subject to queries it is necessary to model the prior knowledge of a possible attacker. Differential privacy considers a worst-case scenario where he knows almost everything, which in many applications is unrealistic and requires a large utility loss. This paper considers a situation called statistical privacy where an adversary knows the distribution by which the database is generated, but no exact data of all (or sufficient many) of its entries. We analyze in detail how the entropy of the distribution guarantes privacy for a large class of queries called property queries. Exact formulas are obtained for the privacy parameters. We analyze how they depend on the probability that an entry fulfills the property under investigation. These formulas turn out to be lengthy, but can be used for tight numerical approximations of the privacy parameters. Such estimations are necessary for applying privacy enhancing techniques in practice. For this statistical setting we further investigate the effect of adding noise or applying subsampling and the privacy utility tradeoff. The dependencies on the parameters are illustrated in detail by a series of plots. Finally, these results are compared to the differential privacy model.
翻译:为分析数据库中个人数据在查询条件下的隐私保障,必须对潜在攻击者的先验知识进行建模。差分隐私考虑攻击者掌握近乎全部信息的最坏情况,这在许多实际应用中并不现实且会导致较大的效用损失。本文研究一种称为统计隐私的场景:攻击者知晓数据库生成所依据的概率分布,但无法获取全部(或足够多)条目的确切数据。我们详细分析了分布熵如何为一大类属性查询提供隐私保障,并推导出隐私参数的精确表达式。通过研究这些参数与被查询属性满足概率之间的依赖关系,发现所得公式虽然形式复杂,但可用于隐私参数的严格数值逼近。此类估计对于实际应用隐私增强技术至关重要。在此统计框架下,我们进一步探究了添加噪声或应用子采样技术的影响,以及隐私与效用的权衡关系。通过系列图示详细展示了参数间的依赖特性。最后,将所得结果与差分隐私模型进行了对比分析。