Differentially Private Clustering in Data Streams

The streaming model is an abstraction of computing over massive data streams, which is a popular way of dealing with large-scale modern data analysis. In this model, there is a stream of data points, one after the other. A streaming algorithm is only allowed one pass over the data stream, and the goal is to perform some analysis during the stream while using as small space as possible. Clustering problems (such as $k$-means and $k$-median) are fundamental unsupervised machine learning primitives, and streaming clustering algorithms have been extensively studied in the past. However, since data privacy becomes a central concern in many real-world applications, non-private clustering algorithms are not applicable in many scenarios. In this work, we provide the first differentially private streaming algorithms for $k$-means and $k$-median clustering of $d$-dimensional Euclidean data points over a stream with length at most $T$ using $poly(k,d,\log(T))$ space to achieve a constant multiplicative error and a $poly(k,d,\log(T))$ additive error. In particular, we present a differentially private streaming clustering framework which only requires an offline DP coreset or clustering algorithm as a blackbox. By plugging in existing results from DP clustering Ghazi, Kumar, Manurangsi 2020 and Kaplan, Stemmer 2018, we achieve (1) a $(1+\gamma)$-multiplicative approximation with $\tilde{O}_\gamma(poly(k,d,\log(T)))$ space for any $\gamma>0$, and the additive error is $poly(k,d,\log(T))$ or (2) an $O(1)$-multiplicative approximation with $\tilde{O}(k^{1.5} \cdot poly(d,\log(T)))$ space and $poly(k,d,\log(T))$ additive error. In addition, our algorithmic framework is also differentially private under the continual release setting, i.e., the union of outputs of our algorithms at every timestamp is always differentially private.

翻译：数据流模型是大规模数据流计算的一种抽象，也是处理现代大数据分析的流行方式。在该模型中，数据点以流式形式逐个到达。流式算法仅允许对数据流进行一次遍历，目标是在使用尽可能小空间的同时对数据流进行分析。聚类问题（如$k$-均值和$k$-中位数）是基础的无监督机器学习原语，过去已有大量针对流式聚类算法的研究。然而，随着数据隐私成为许多实际应用的核心关注点，非隐私聚类算法在许多场景中不再适用。本文首次提出了针对$d$维欧几里得数据流的差分隐私流式算法，用于在流长度不超过$T$的情况下实现$k$-均值和$k$-中位数聚类，所需空间为$poly(k,d,\log(T))$，能够达到常数乘法误差和$poly(k,d,\log(T))$加法误差。具体而言，我们提出了一种差分隐私流式聚类框架，该框架仅需将离线差分隐私核心集或聚类算法作为黑盒使用。通过集成Ghazi、Kumar、Manurangsi（2020年）以及Kaplan、Stemmer（2018年）的现有差分隐私聚类结果，我们实现了：(1) 对于任意$\gamma>0$，使用$\tilde{O}_\gamma(poly(k,d,\log(T)))$空间获得$(1+\gamma)$倍乘法近似，加法误差为$poly(k,d,\log(T))$；或(2) 使用$\tilde{O}(k^{1.5} \cdot poly(d,\log(T)))$空间获得$O(1)$倍乘法近似，加法误差为$poly(k,d,\log(T))$。此外，我们的算法框架在持续发布设置下也满足差分隐私，即每个时间戳上算法输出的并集始终保持差分隐私。