Many methods in differentially private model training rely on computing the similarity between a query point (such as public or synthetic data) and private data. We abstract out this common subroutine and study the following fundamental algorithmic problem: Given a similarity function $f$ and a large high-dimensional private dataset $X \subset \mathbb{R}^d$, output a differentially private (DP) data structure which approximates $\sum_{x \in X} f(x,y)$ for any query $y$. We consider the cases where $f$ is a kernel function, such as $f(x,y) = e^{-\|x-y\|_2^2/\sigma^2}$ (also known as DP kernel density estimation), or a distance function such as $f(x,y) = \|x-y\|_2$, among others. Our theoretical results improve upon prior work and give better privacy-utility trade-offs as well as faster query times for a wide range of kernels and distance functions. The unifying approach behind our results is leveraging `low-dimensional structures' present in the specific functions $f$ that we study, using tools such as provable dimensionality reduction, approximation theory, and one-dimensional decomposition of the functions. Our algorithms empirically exhibit improved query times and accuracy over prior state of the art. We also present an application to DP classification. Our experiments demonstrate that the simple methodology of classifying based on average similarity is orders of magnitude faster than prior DP-SGD based approaches for comparable accuracy.
翻译:差分隐私模型训练中的许多方法依赖于计算查询点(如公共数据或合成数据)与私有数据之间的相似度。我们将这一通用子程序抽象出来,并研究以下基础算法问题:给定相似度函数 $f$ 和一个大规模高维私有数据集 $X \subset \mathbb{R}^d$,输出一个能对任意查询 $y$ 近似计算 $\sum_{x \in X} f(x,y)$ 的差分隐私(DP)数据结构。我们考虑了 $f$ 为核函数的情形,例如 $f(x,y) = e^{-\|x-y\|_2^2/\sigma^2}$(即DP核密度估计),或距离函数如 $f(x,y) = \|x-y\|_2$ 等。我们的理论成果改进了先前的工作,在更广泛的核函数和距离函数上实现了更优的隐私-效用权衡以及更快的查询时间。这些成果背后的统一方法是利用所研究的特定函数 $f$ 中存在的"低维结构",通过可证明的降维、逼近理论以及函数的一维分解等工具实现。我们的算法在经验上相比先前最先进方法展现出更快的查询时间和更高的精度。我们还提出了一个应用于DP分类的案例。实验表明,基于平均相似度进行简单分类的方法,在达到可比精度时,其速度比先前基于DP-SGD的方法快数个数量级。