Differentially Private Clustering in Data Streams

The streaming model is an abstraction of computing over massive data streams, which is a popular way of dealing with large-scale modern data analysis. In this model, there is a stream of data points, one after the other. A streaming algorithm is only allowed one pass over the data stream, and the goal is to perform some analysis during the stream while using as small space as possible. Clustering problems (such as $k$-means and $k$-median) are fundamental unsupervised machine learning primitives, and streaming clustering algorithms have been extensively studied in the past. However, since data privacy becomes a central concern in many real-world applications, non-private clustering algorithms are not applicable in many scenarios. In this work, we provide the first differentially private streaming algorithms for $k$-means and $k$-median clustering of $d$-dimensional Euclidean data points over a stream with length at most $T$ using $poly(k,d,\log(T))$ space to achieve a {\it constant} multiplicative error and a $poly(k,d,\log(T))$ additive error. In particular, we present a differentially private streaming clustering framework which only requires an offline DP coreset algorithm as a blackbox. By plugging in existing DP coreset results via Ghazi, Kumar, Manurangsi 2020 and Kaplan, Stemmer 2018, we achieve (1) a $(1+\gamma)$-multiplicative approximation with $\tilde{O}_\gamma(poly(k,d,\log(T)))$ space for any $\gamma>0$, and the additive error is $poly(k,d,\log(T))$ or (2) an $O(1)$-multiplicative approximation with $\tilde{O}(k \cdot poly(d,\log(T)))$ space and $poly(k,d,\log(T))$ additive error. In addition, our algorithmic framework is also differentially private under the continual release setting, i.e., the union of outputs of our algorithms at every timestamp is always differentially private.

翻译：流式模型是对大规模数据流进行计算的抽象，也是处理现代大规模数据分析的主流方式。在该模型中，数据点以流的形式逐一到达。流式算法仅允许对数据流进行一次遍历，目标是在使用尽可能少存储空间的前提下，在数据流处理过程中完成某种分析。聚类问题（如$k$-均值和$k$-中位数）是机器学习中基础的无监督学习范型，流式聚类算法在过去已得到广泛研究。然而，由于数据隐私已成为许多实际应用中的核心关切，非隐私聚类算法在许多场景中不再适用。本文首次提出针对$d$维欧几里得数据点的$k$-均值和$k$-中位数聚类的差分隐私流式算法，该算法在长度不超过$T$的数据流上，使用$poly(k,d,\log(T))$空间，实现{\it 常数}倍数乘法误差和$poly(k,d,\log(T))$加法误差。特别地，我们提出了一种差分隐私流式聚类框架，该框架仅需将离线DP核心集算法作为黑箱模块使用。通过集成Ghazi、Kumar、Manurangsi（2020）以及Kaplan、Stemmer（2018）的现有DP核心集结果，我们实现了：（1）对于任意$\gamma>0$，在$\tilde{O}_\gamma(poly(k,d,\log(T)))$空间下达到$(1+\gamma)$倍乘法近似，加法误差为$poly(k,d,\log(T))$；或（2）在$\tilde{O}(k \cdot poly(d,\log(T)))$空间下达到$O(1)$倍乘法近似，加法误差为$poly(k,d,\log(T))$。此外，本算法框架在持续发布设定下也满足差分隐私，即算法在每个时间戳输出的并集始终保持差分隐私。