Statistical heterogeneity is a measure of how skewed the samples of a dataset are. It is a common problem in the study of differential privacy that the usage of a statistically heterogeneous dataset results in a significant loss of accuracy. In federated scenarios, statistical heterogeneity is more likely to happen, and so the above problem is even more pressing. We explore the three most promising ways to measure statistical heterogeneity and give formulae for their accuracy, while simultaneously incorporating differential privacy. We find the optimum privacy parameters via an analytic mechanism, which incorporates root finding methods. We validate the main theorems and related hypotheses experimentally, and test the robustness of the analytic mechanism to different heterogeneity levels. The analytic mechanism in a distributed setting delivers superior accuracy to all combinations involving the classic mechanism and/or the centralized setting. All measures of statistical heterogeneity do not lose significant accuracy when a heterogeneous sample is used.
翻译:统计异质性衡量数据集中样本分布的偏斜程度。在差分隐私研究中,使用统计异质性数据集通常会导致显著的精度损失,这是一个常见问题。在联邦场景中,统计异质性更易出现,因此上述问题尤为紧迫。本文探索了三种最具前景的统计异质性度量方法,在引入差分隐私的同时给出了其精度计算公式。我们通过融合求根方法的解析机制求解最优隐私参数。实验验证了主要定理及相关假设,并测试了解析机制对不同异质性水平的鲁棒性。分布式场景中的解析机制在精度上优于所有包含经典机制和/或集中式设置的组合方案。当使用异质性样本时,所有统计异质性度量方法均未出现显著的精度损失。