Differentially private mean estimation is an important building block in privacy-preserving algorithms for data analysis and machine learning. Though the trade-off between privacy and utility is well understood in the worst case, many datasets exhibit structure that could potentially be exploited to yield better algorithms. In this paper we present $\textit{Private Limit Adapted Noise}$ (PLAN), a family of differentially private algorithms for mean estimation in the setting where inputs are independently sampled from a distribution $\mathcal{D}$ over $\mathbf{R}^d$, with coordinate-wise standard deviations $\boldsymbol{\sigma} \in \mathbf{R}^d$. Similar to mean estimation under Mahalanobis distance, PLAN tailors the shape of the noise to the shape of the data, but unlike previous algorithms the privacy budget is spent non-uniformly over the coordinates. Under a concentration assumption on $\mathcal{D}$, we show how to exploit skew in the vector $\boldsymbol{\sigma}$, obtaining a (zero-concentrated) differentially private mean estimate with $\ell_2$ error proportional to $\|\boldsymbol{\sigma}\|_1$. Previous work has either not taken $\boldsymbol{\sigma}$ into account, or measured error in Mahalanobis distance $\unicode{x2013}$ in both cases resulting in $\ell_2$ error proportional to $\sqrt{d}\|\boldsymbol{\sigma}\|_2$, which can be up to a factor $\sqrt{d}$ larger. To verify the effectiveness of PLAN, we empirically evaluate accuracy on both synthetic and real world data.
翻译:差分隐私均值估计是数据分析和机器学习隐私保护算法中的重要组成部分。尽管在最坏情况下隐私与效用之间的权衡已得到充分理解,但许多数据集展现出的结构可被利用以设计更优算法。本文提出$\textit{私有极限自适应噪声}$(PLAN)系列差分隐私算法,适用于输入从$\mathbf{R}^d$上的分布$\mathcal{D}$独立采样且具有坐标标准差$\boldsymbol{\sigma} \in \mathbf{R}^d$的均值估计场景。与马氏距离下的均值估计类似,PLAN根据数据形态定制噪声形状,但不同于先前算法,其隐私预算在坐标上非均匀分配。在$\mathcal{D}$的集中性假设下,我们展示了如何利用向量$\boldsymbol{\sigma}$的偏斜特性,获得$\ell_2$误差与$\|\boldsymbol{\sigma}\|_1$成比例的(零集中)差分隐私均值估计。先前研究要么未考虑$\boldsymbol{\sigma}$,要么以马氏距离度量误差——两种情形下$\ell_2$误差均与$\sqrt{d}\|\boldsymbol{\sigma}\|_2$成比例,该值最大可达$\sqrt{d}$倍。为验证PLAN的有效性,我们在合成数据和真实数据上进行了经验性精度评估。