We introduce a refined differentially private (DP) data structure for kernel density estimation (KDE), offering not only improved privacy-utility tradeoff but also better efficiency over prior results. Specifically, we study the mathematical problem: given a similarity function $f$ (or DP KDE) and a private dataset $X \subset \mathbb{R}^d$, our goal is to preprocess $X$ so that for any query $y\in\mathbb{R}^d$, we approximate $\sum_{x \in X} f(x, y)$ in a differentially private fashion. The best previous algorithm for $f(x,y) =\| x - y \|_1$ is the node-contaminated balanced binary tree by [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. Their algorithm requires $O(nd)$ space and time for preprocessing with $n=|X|$. For any query point, the query time is $d \log n$, with an error guarantee of $(1+\alpha)$-approximation and $\epsilon^{-1} \alpha^{-0.5} d^{1.5} R \log^{1.5} n$. In this paper, we improve the best previous result [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024] in three aspects: - We reduce query time by a factor of $\alpha^{-1} \log n$. - We improve the approximation ratio from $\alpha$ to 1. - We reduce the error dependence by a factor of $\alpha^{-0.5}$. From a technical perspective, our method of constructing the search tree differs from previous work [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. In prior work, for each query, the answer is split into $\alpha^{-1} \log n$ numbers, each derived from the summation of $\log n$ values in interval tree countings. In contrast, we construct the tree differently, splitting the answer into $\log n$ numbers, where each is a smart combination of two distance values, two counting values, and $y$ itself. We believe our tree structure may be of independent interest.
翻译:本文提出了一种改进的差分隐私核密度估计数据结构,不仅在隐私-效用权衡方面优于先前结果,同时具备更高的计算效率。具体而言,我们研究以下数学问题:给定相似度函数$f$(即差分隐私KDE)和私有数据集$X \subset \mathbb{R}^d$,目标是对$X$进行预处理,使得对于任意查询点$y\in\mathbb{R}^d$,能以差分隐私方式近似计算$\sum_{x \in X} f(x, y)$。针对$f(x,y) =\| x - y \|_1$函数,现有最佳算法是[Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]提出的节点污染平衡二叉树。该算法预处理需要$O(nd)$空间和时间(其中$n=|X|$),查询时间为$d \log n$,误差保证为$(1+\alpha)$近似比及$\epsilon^{-1} \alpha^{-0.5} d^{1.5} R \log^{1.5} n$误差界。本文在三个方面改进了先前最佳结果:首先,将查询时间降低$\alpha^{-1} \log n$倍;其次,将近似比从$\alpha$提升至1;最后,将误差依赖降低$\alpha^{-0.5}$倍。从技术视角看,我们构建搜索树的方法与先前工作存在本质差异:在[Backurs等, ICLR 2024]的方法中,每个查询的答案被分解为$\alpha^{-1} \log n$个数值,每个数值源自区间树计数中$\log n$个值的求和;而本文构建的树将答案分解为$\log n$个数值,每个数值是两个距离值、两个计数值与查询点$y$本身的智能组合。我们相信这种树结构设计可能具有独立的研究价值。