We consider the problem of privately estimating a parameter $\mathbb{E}[h(X_1,\dots,X_k)]$, where $X_1$, $X_2$, $\dots$, $X_k$ are i.i.d. data from some distribution and $h$ is a permutation-invariant function. Without privacy constraints, standard estimators are U-statistics, which commonly arise in a wide range of problems, including nonparametric signed rank tests, symmetry testing, uniformity testing, and subgraph counts in random networks, and can be shown to be minimum variance unbiased estimators under mild conditions. Despite the recent outpouring of interest in private mean estimation, privatizing U-statistics has received little attention. While existing private mean estimation algorithms can be applied to obtain confidence intervals, we show that they can lead to suboptimal private error, e.g., constant-factor inflation in the leading term, or even $\Theta(1/n)$ rather than $O(1/n^2)$ in degenerate settings. To remedy this, we propose a new thresholding-based approach using \emph{local H\'ajek projections} to reweight different subsets of the data. This leads to nearly optimal private error for non-degenerate U-statistics and a strong indication of near-optimality for degenerate U-statistics.
翻译:我们研究如何差分隐私地估计参数 $\mathbb{E}[h(X_1,\dots,X_k)]$,其中 $X_1$, $X_2$, $\dots$, $X_k$ 是来自某分布的独立同分布数据,$h$ 是置换不变函数。在无隐私约束下,标准估计量是U统计量,其广泛出现于众多问题中,包括非参数符号秩检验、对称性检验、均匀性检验以及随机网络中的子图计数,并可在温和条件下被证明是最小方差无偏估计量。尽管近期对隐私均值估计的关注激增,但U统计量的隐私化处理却鲜有研究。虽然现有隐私均值估计算法可用于获得置信区间,但我们证明它们可能导致次优的隐私误差,例如主项中的常数倍膨胀,甚至在退化场景下出现 $\Theta(1/n)$ 而非 $O(1/n^2)$ 的误差。为解决此问题,我们提出一种基于阈值化的新方法,利用\emph{局部H\'ajek投影}对数据的不同子集进行重新加权。该方法对非退化U统计量实现了近乎最优的隐私误差,并对退化U统计量给出了接近最优性的强有力证据。