Statistical heterogeneity is a measure of how skewed the samples of a dataset are. It is a common problem in the study of differential privacy that the usage of a statistically heterogeneous dataset results in a significant loss of accuracy. In federated scenarios, statistical heterogeneity is more likely to happen, and so the above problem is even more pressing. We explore the three most promising ways to measure statistical heterogeneity and give formulae for their accuracy, while simultaneously incorporating differential privacy. We find the optimum privacy parameters via an analytic mechanism, which incorporates root finding methods. We validate the main theorems and related hypotheses experimentally, and test the robustness of the analytic mechanism to different heterogeneity levels. The analytic mechanism in a distributed setting delivers superior accuracy to all combinations involving the classic mechanism and/or the centralized setting. All measures of statistical heterogeneity do not lose significant accuracy when a heterogeneous sample is used.
翻译:统计异构性是衡量数据集样本分布偏斜程度的指标。在差分隐私研究中,使用统计异构数据集会导致显著的精度损失,这是一个常见问题。在联邦场景中,统计异构性更易发生,因此上述问题更为紧迫。我们探索了三种最具前景的统计异构性度量方法,在引入差分隐私的同时给出了其精度计算公式。通过结合求根方法的解析机制,我们找到了最优隐私参数。我们通过实验验证了主要定理及相关假设,并测试了解析机制对不同异构水平的鲁棒性。在分布式环境中,解析机制相较于所有涉及经典机制和/或集中式设置的组合方案均展现出更优的精度。当使用异构样本时,所有统计异构性度量方法均未出现显著的精度损失。