Private Synthetic Data Generation in Small Memory

Protecting sensitive information on data streams is a critical challenge for modern systems. Current approaches to privacy in data streams follow two strategies. The first transforms the stream into a private sequence, enabling the use of non-private analyses but incurring high memory costs. The second uses compact data structures to create private summaries but restricts flexibility to predefined queries. To address these limitations, we propose $\textsf{PrivHP}$, a lightweight synthetic data generator that ensures differential privacy while being resource-efficient. $\textsf{PrivHP}$ generates private synthetic data that preserves the input stream's distribution, allowing flexible downstream analyses without additional privacy costs. It leverages a hierarchical decomposition of the domain, pruning low-frequency subdomains while preserving high-frequency ones in a privacy-preserving manner. To achieve memory efficiency in streaming contexts, $\textsf{PrivHP}$ uses private sketches to estimate subdomain frequencies without accessing the full dataset. $\textsf{PrivHP}$ is parameterized by a privacy budget $\varepsilon$, a pruning parameter $k$ and the sketch width $w$. It can process a dataset of size $n$ in $\mathcal{O}((w+k)\log (\varepsilon n))$ space, $\mathcal{O}(\log (\varepsilon n))$ update time, and outputs a private synthetic data generator in $\mathcal{O}(k\log k\log (\varepsilon n))$ time. Prior methods require $\Omega(n)$ space and construction time. Our evaluation uses the expected 1-Wasserstein distance between the sampler and the empirical distribution. Compared to state-of-the-art methods, we demonstrate that the additional cost in utility is inversely proportional to $k$ and $w$. This represents the first meaningful trade-off between performance and utility for private synthetic data generation.

翻译：保护数据流中的敏感信息是现代系统面临的关键挑战。当前针对数据流隐私保护的方法主要遵循两种策略。第一种将数据流转换为私有序列，从而能够使用非私有分析，但会导致较高的内存开销。第二种使用紧凑数据结构创建私有摘要，但将灵活性限制在预定义的查询范围内。为克服这些局限性，我们提出$\textsf{PrivHP}$——一种轻量级合成数据生成器，在保证差分隐私的同时实现资源高效利用。$\textsf{PrivHP}$生成的私有合成数据能够保持输入流的分布特性，支持灵活的下游分析而无需额外隐私代价。该方法通过领域的分层分解，以隐私保护方式剪枝低频子域同时保留高频子域。为在流式场景中实现内存高效性，$\textsf{PrivHP}$采用私有草图技术估计子域频率而无需访问完整数据集。$\textsf{PrivHP}$由隐私预算$\varepsilon$、剪枝参数$k$和草图宽度$w$共同参数化。该算法能以$\mathcal{O}((w+k)\log (\varepsilon n))$空间复杂度、$\mathcal{O}(\log (\varepsilon n))$更新时间复杂度处理规模为$n$的数据集，并以$\mathcal{O}(k\log k\log (\varepsilon n))$时间复杂度输出私有合成数据生成器。现有方法需要$\Omega(n)$级别的空间与构建时间。我们采用采样器与经验分布之间的期望1-Wasserstein距离进行评估。与前沿方法相比，我们证明效用损失与$k$和$w$成反比关系。这标志着私有合成数据生成领域首次实现了性能与效用间的有效权衡。