Hierarchical Clustering is a popular unsupervised machine learning method with decades of history and numerous applications. We initiate the study of differentially private approximation algorithms for hierarchical clustering under the rigorous framework introduced by (Dasgupta, 2016). We show strong lower bounds for the problem: that any $\epsilon$-DP algorithm must exhibit $O(|V|^2/ \epsilon)$-additive error for an input dataset $V$. Then, we exhibit a polynomial-time approximation algorithm with $O(|V|^{2.5}/ \epsilon)$-additive error, and an exponential-time algorithm that meets the lower bound. To overcome the lower bound, we focus on the stochastic block model, a popular model of graphs, and, with a separation assumption on the blocks, propose a private $1+o(1)$ approximation algorithm which also recovers the blocks exactly. Finally, we perform an empirical study of our algorithms and validate their performance.
翻译:层次聚类是一种流行的无监督机器学习方法,拥有数十年的历史及众多应用。我们首次在(Dasgupta, 2016)提出的严格框架下,研究面向层次聚类的差分隐私近似算法。我们证明了该问题的强下界:对于输入数据集$V$,任何$\epsilon$-差分隐私算法必须具有$O(|V|^2/ \epsilon)$的加性误差。随后,我们提出一个具有$O(|V|^{2.5}/ \epsilon)$加性误差的多项式时间近似算法,以及一个达到该下界的指数时间算法。为突破下界,我们聚焦于随机块模型这一流行的图模型,并在块间分离假设下,提出一个既能精确恢复块结构、又满足隐私性的$1+o(1)$近似算法。最后,我们对该算法进行了实验研究,并验证了其性能。