Privacy-preserving clustering groups data points in an unsupervised manner whilst ensuring that sensitive information remains protected. Previous privacy-preserving clustering focused on identifying concentration of point clouds. In this paper, we take another path and focus on identifying appropriate separators that split a data set. We introduce the novel differentially private clustering algorithm DPM that searches for accurate data point separators in a differentially private manner. DPM addresses two key challenges for finding accurate separators: identifying separators that are large gaps between clusters instead of small gaps within a cluster and, to efficiently spend the privacy budget, prioritising separators that split the data into large subparts. Using the differentially private Exponential Mechanism, DPM randomly chooses cluster separators with provably high utility: For a data set $D$, if there is a wide low-density separator in the central $60\%$ quantile, DPM finds that separator with probability $1 - \exp(-\sqrt{|D|})$. Our experimental evaluation demonstrates that DPM achieves significant improvements in terms of the clustering metric inertia. With the inertia results of the non-private KMeans++ as a baseline, for $\varepsilon = 1$ and $\delta=10^{-5}$ DPM improves upon the difference to the baseline by up to $50\%$ for a synthetic data set and by up to $62\%$ for a real-world data set compared to a state-of-the-art clustering algorithm by Chang and Kamath.
翻译:摘要:隐私保护聚类在确保敏感信息得到保护的同时,以无监督方式对数据点进行分组。以往的隐私保护聚类侧重于识别点云的密集区域。本文另辟蹊径,聚焦于识别能够分割数据集的合适分离器。我们提出一种新颖的差分隐私聚类算法DPM,该算法以差分隐私方式搜索准确的数据点分离器。DPM针对寻找准确分离器需解决两大关键挑战:一是识别簇间大间隙而非簇内小间隙的分离器,二是为高效利用隐私预算,优先选择能将数据分割成较大子部分的分离器。借助差分隐私指数机制,DPM能以可证明的高效用随机选择簇分离器:对于数据集$D$,若其中央$60\%$分位数存在一个宽的低密度分离器,则DPM以$1 - \exp(-\sqrt{|D|})$的概率找到该分离器。实验评估表明,DPM在聚类指标惯性方面实现了显著提升。以非私有KMeans++的惯性结果作为基线,在$\varepsilon = 1$且$\delta=10^{-5}$条件下,与Chang和Kamath提出的最先进聚类算法相比,DPM在合成数据集上将该指标与基线的差异最多改善50%,在真实数据集上最多改善62%。