Differentially Private Kernel Density Estimation

We introduce a refined differentially private (DP) data structure for kernel density estimation (KDE), offering not only improved privacy-utility tradeoff but also better efficiency over prior results. Specifically, we study the mathematical problem: given a similarity function $f$ (or DP KDE) and a private dataset $X \subset \mathbb{R}^d$, our goal is to preprocess $X$ so that for any query $y\in\mathbb{R}^d$, we approximate $\sum_{x \in X} f(x, y)$ in a differentially private fashion. The best previous algorithm for $f(x,y) =\| x - y \|_1$ is the node-contaminated balanced binary tree by [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. Their algorithm requires $O(nd)$ space and time for preprocessing with $n=|X|$. For any query point, the query time is $d \log n$, with an error guarantee of $(1+\alpha)$-approximation and $\epsilon^{-1} \alpha^{-0.5} d^{1.5} R \log^{1.5} n$. In this paper, we improve the best previous result [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024] in three aspects: - We reduce query time by a factor of $\alpha^{-1} \log n$. - We improve the approximation ratio from $\alpha$ to 1. - We reduce the error dependence by a factor of $\alpha^{-0.5}$. From a technical perspective, our method of constructing the search tree differs from previous work [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. In prior work, for each query, the answer is split into $\alpha^{-1} \log n$ numbers, each derived from the summation of $\log n$ values in interval tree countings. In contrast, we construct the tree differently, splitting the answer into $\log n$ numbers, where each is a smart combination of two distance values, two counting values, and $y$ itself. We believe our tree structure may be of independent interest.

翻译：本文提出了一种改进的差分隐私（DP）核密度估计（KDE）数据结构，该结构不仅提供了更优的隐私-效用权衡，而且在效率上超越了先前的研究成果。具体而言，我们研究以下数学问题：给定相似性函数 $f$（或差分隐私 KDE）和私有数据集 $X \subset \mathbb{R}^d$，我们的目标是对 $X$ 进行预处理，使得对于任意查询点 $y\in\mathbb{R}^d$，我们能够以差分隐私的方式近似计算 $\sum_{x \in X} f(x, y)$。针对 $f(x,y) =\| x - y \|_1$ 函数，先前的最佳算法是 [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024] 提出的节点污染平衡二叉树。该算法的预处理需要 $O(nd)$ 的空间和时间，其中 $n=|X|$。对于任意查询点，其查询时间为 $d \log n$，误差保证为 $(1+\alpha)$-近似，且误差项为 $\epsilon^{-1} \alpha^{-0.5} d^{1.5} R \log^{1.5} n$。在本文中，我们在三个方面改进了先前的最佳结果 [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]：- 我们将查询时间减少了 $\alpha^{-1} \log n$ 倍。- 我们将近似比从 $\alpha$ 改进为 1。- 我们将误差依赖项减少了 $\alpha^{-0.5}$ 倍。从技术角度看，我们构建搜索树的方法与先前工作 [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024] 不同。在先前工作中，对于每个查询，答案被分解为 $\alpha^{-1} \log n$ 个数值，每个数值源自区间树计数中 $\log n$ 个值的求和。相比之下，我们以不同的方式构建树，将答案分解为 $\log n$ 个数值，其中每个数值是两个距离值、两个计数值以及 $y$ 本身的一个巧妙组合。我们相信我们的树结构可能具有独立的研究价值。