Privacy protection of users' entire contribution of samples is important in distributed systems. The most effective approach is the two-stage scheme, which finds a small interval first and then gets a refined estimate by clipping samples into the interval. However, the clipping operation induces bias, which is serious if the sample distribution is heavy-tailed. Besides, users with large local sample sizes can make the sensitivity much larger, thus the method is not suitable for imbalanced users. Motivated by these challenges, we propose a Huber loss minimization approach to mean estimation under user-level differential privacy. The connecting points of Huber loss can be adaptively adjusted to deal with imbalanced users. Moreover, it avoids the clipping operation, thus significantly reducing the bias compared with the two-stage approach. We provide a theoretical analysis of our approach, which gives the noise strength needed for privacy protection, as well as the bound of mean squared error. The result shows that the new method is much less sensitive to the imbalance of user-wise sample sizes and the tail of sample distributions. Finally, we perform numerical experiments to validate our theoretical analysis.
翻译:在分布式系统中,保护用户全部样本贡献的隐私至关重要。最有效的方法是两阶段方案:首先确定一个较小的区间,然后通过将样本裁剪至该区间获得精确估计。然而,裁剪操作会引入偏差,当样本分布呈现重尾特征时该偏差尤为严重。此外,本地样本量较大的用户会显著增大敏感度,导致该方法不适用于用户样本量不均衡的场景。针对这些挑战,本文提出一种基于Huber损失最小化的用户级差分隐私均值估计方法。该方法可自适应调整Huber损失的连接点以处理用户不均衡问题,同时避免了裁剪操作,相比两阶段方案显著降低了估计偏差。我们提供了该方法的理论分析,包括隐私保护所需的噪声强度以及均方误差的界。结果表明,新方法对用户样本量的不均衡性和样本分布的尾部特征具有更低的敏感性。最后,我们通过数值实验验证了理论分析的有效性。