Differentially Private Community Detection for Stochastic Block Models

The goal of community detection over graphs is to recover underlying labels/attributes of users (e.g., political affiliation) given the connectivity between users (represented by adjacency matrix of a graph). There has been significant recent progress on understanding the fundamental limits of community detection when the graph is generated from a stochastic block model (SBM). Specifically, sharp information theoretic limits and efficient algorithms have been obtained for SBMs as a function of $p$ and $q$, which represent the intra-community and inter-community connection probabilities. In this paper, we study the community detection problem while preserving the privacy of the individual connections (edges) between the vertices. Focusing on the notion of $(\epsilon, \delta)$-edge differential privacy (DP), we seek to understand the fundamental tradeoffs between $(p, q)$, DP budget $(\epsilon, \delta)$, and computational efficiency for exact recovery of the community labels. To this end, we present and analyze the associated information-theoretic tradeoffs for three broad classes of differentially private community recovery mechanisms: a) stability based mechanism; b) sampling based mechanisms; and c) graph perturbation mechanisms. Our main findings are that stability and sampling based mechanisms lead to a superior tradeoff between $(p,q)$ and the privacy budget $(\epsilon, \delta)$; however this comes at the expense of higher computational complexity. On the other hand, albeit low complexity, graph perturbation mechanisms require the privacy budget $\epsilon$ to scale as $\Omega(\log(n))$ for exact recovery. To the best of our knowledge, this is the first work to study the impact of privacy constraints on the fundamental limits for community detection.

翻译：图上的社区检测目标是在给定用户间连接关系（由图邻接矩阵表示）的情况下，恢复用户潜在的标签/属性（如政治倾向）。近年来，当图由随机块模型（SBM）生成时，社区检测基本极限的理解取得了显著进展。具体而言，针对SBM，作为社区内与社区间连接概率$p$和$q$的函数，人们已获得了尖锐的信息论极限和高效算法。本文研究在保护顶点间个体连接（边）隐私的前提下解决社区检测问题。以$(\epsilon, \delta)$-边差分隐私（DP）概念为核心，我们试图理解$(p, q)$、DP预算$(\epsilon, \delta)$与社区标签精确恢复计算效率之间的基本权衡。为此，我们针对三类差分隐私社区恢复机制进行了分析与信息论权衡研究：a) 基于稳定性的机制；b) 基于采样的机制；c) 图扰动机制。主要发现是：基于稳定性和采样的机制在$(p,q)$与隐私预算$(\epsilon, \delta)$之间实现了更优的权衡，但代价是较高的计算复杂度。另一方面，尽管图扰动机制复杂度低，但其在精确恢复时要求隐私预算$\epsilon$需达到$\Omega(\log(n))$量级。据我们所知，这是首个研究隐私约束对社区检测基本极限影响的工作。