We propose $\mathtt{PrivHP}$, a lightweight synthetic data generator with \textit{differential privacy} guarantees. $\mathtt{PrivHP}$ uses a novel hierarchical decomposition that approximates the input's cumulative distribution function (CDF) in bounded memory. It balances hierarchy depth, noise addition, and pruning of low-frequency subdomains while preserving frequent ones. Private sketches estimate subdomain frequencies efficiently without full data access. A key feature is the pruning parameter $k$, which controls the trade-off between space and utility. We define the skew measure $\mathtt{tail}_k$, capturing all but the top $k$ subdomain frequencies. Given a dataset $\mathcal{X}$, $\mathtt{PrivHP}$ uses $M=\mathcal{O}(k\log^2 |X|)$ space and, for input domain $\Omega = [0,1]$, ensures $\varepsilon$-differential privacy. It yields a generator with expected Wasserstein distance: \[ \mathcal{O}\left(\frac{\log^2 M}{\varepsilon n} + \frac{||\mathtt{tail}_k(\mathcal{X})||_1}{M n}\right) \] from the empirical distribution. This parameterized trade-off offers a level of flexibility unavailable in prior work. We also provide interpretable utility bounds that account for hierarchy depth, privacy noise, pruning, and frequency estimation errors.
翻译:我们提出$\mathtt{PrivHP}$,一种具有\textit{差分隐私}保证的轻量级合成数据生成器。$\mathtt{PrivHP}$采用一种新颖的层次分解方法,在有限内存内近似输入数据的累积分布函数(CDF)。该方法在层次深度、噪声添加与低频子域剪枝之间取得平衡,同时保留高频子域。私有草图可在无需完全访问数据的情况下高效估计子域频率。一个关键特性是剪枝参数$k$,它控制了空间与效用之间的权衡。我们定义了偏斜度量$\mathtt{tail}_k$,用于捕获除前$k$个子域频率之外的所有频率。给定数据集$\mathcal{X}$,$\mathtt{PrivHP}$使用$M=\mathcal{O}(k\log^2 |X|)$的空间,并且对于输入域$\Omega = [0,1]$,确保$\varepsilon$-差分隐私。它产生的生成器与经验分布之间的期望Wasserstein距离为:\[ \mathcal{O}\left(\frac{\log^2 M}{\varepsilon n} + \frac{||\mathtt{tail}_k(\mathcal{X})||_1}{M n}\right) \]。这种参数化的权衡提供了先前工作中所不具备的灵活性。我们还提供了可解释的效用界,该界考虑了层次深度、隐私噪声、剪枝以及频率估计误差。