We study the problem of histogram estimation under user-level differential privacy, where the goal is to preserve the privacy of all entries of any single user. We consider the heterogeneous scenario where the quantity of data can be different for each user. In this scenario, the amount of noise injected into the histogram to obtain differential privacy is proportional to the maximum user contribution, which can be amplified by few outliers. One approach to circumvent this would be to bound (or limit) the contribution of each user to the histogram. However, if users are limited to small contributions, a significant amount of data will be discarded. In this work, we propose algorithms to choose the best user contribution bound for histogram estimation under both bounded and unbounded domain settings. When the size of the domain is bounded, we propose a user contribution bounding strategy that almost achieves a two-approximation with respect to the best contribution bound in hindsight. For unbounded domain histogram estimation, we propose an algorithm that is logarithmic-approximation with respect to the best contribution bound in hindsight. This result holds without any distribution assumptions on the data. Experiments on both real and synthetic datasets verify our theoretical findings and demonstrate the effectiveness of our algorithms. We also show that clipping bias introduced by bounding user contribution may be reduced under mild distribution assumptions, which can be of independent interest.
翻译:我们研究用户级差分隐私下的直方图估计问题,其目标是保护单个用户所有数据条目的隐私。考虑数据量因用户而异的异构场景,此时为使直方图满足差分隐私所需注入的噪声量与最大用户贡献值成正比,而少数异常值会放大该数值。解决方案之一是限制每位用户对直方图的贡献上界。然而,若强制用户贡献值过小,大量数据将被丢弃。本文提出在有限域与无限域两种设定下选择最优用户贡献上界的算法。当域规模有限时,所提策略能近乎实现事后最优贡献上界的二倍近似。针对无限域直方图估计,我们提出一种对数近似于事后最优贡献上界的算法,该结果无需对数据分布做任何假设。在真实与合成数据集上的实验验证了理论发现并展示了算法有效性。我们还证明,在温和分布假设下,可通过削减由贡献上界引入的截断偏差——这一结论本身具有独立研究价值。