We study differentially private (DP) mean estimation in the case where each person holds multiple samples. Commonly referred to as the "user-level" setting, DP here requires the usual notion of distributional stability when all of a person's datapoints can be modified. Informally, if $n$ people each have $m$ samples from an unknown $d$-dimensional distribution with bounded $k$-th moments, we show that \[n = \tilde \Theta\left(\frac{d}{\alpha^2 m} + \frac{d }{ \alpha m^{1/2} \varepsilon} + \frac{d}{\alpha^{k/(k-1)} m \varepsilon} + \frac{d}{\varepsilon}\right)\] people are necessary and sufficient to estimate the mean up to distance $\alpha$ in $\ell_2$-norm under $\varepsilon$-differential privacy (and its common relaxations). In the multivariate setting, we give computationally efficient algorithms under approximate DP (with slightly degraded sample complexity) and computationally inefficient algorithms under pure DP, and our nearly matching lower bounds hold for the most permissive case of approximate DP. Our computationally efficient estimators are based on the well known noisy-clipped-mean approach, but the analysis for our setting requires new bounds on the tails of sums of independent, vector-valued, bounded-moments random variables, and a new argument for bounding the bias introduced by clipping.
翻译:我们研究差分隐私(DP)均值估计问题,其中每个个体持有多个样本。这通常被称为"用户级"设置,在此设置下,当个体的所有数据点都可能被修改时,DP要求满足通常的分布稳定性概念。非正式地说,如果$n$个个体各自拥有来自未知$d$维分布的$m$个样本,且该分布具有有界$k$阶矩,我们证明在$\varepsilon$-差分隐私(及其常见松弛形式)下,需要且仅需要\[n = \tilde \Theta\left(\frac{d}{\alpha^2 m} + \frac{d }{ \alpha m^{1/2} \varepsilon} + \frac{d}{\alpha^{k/(k-1)} m \varepsilon} + \frac{d}{\varepsilon}\right)\]个个体,即可在$\ell_2$范数下以距离$\alpha$估计均值。在多变量设置中,我们给出了近似DP下计算高效的算法(样本复杂度略有降低)和纯DP下计算低效的算法,而我们近乎匹配的下界对于最宽松的近似DP情况同样成立。我们计算高效的估计器基于广为人知的噪声裁剪均值方法,但针对本设置的分析需要:对独立、向量值、有界矩随机变量和尾概率的新界,以及用于限制裁剪引入偏差的新论证方法。