We study the problem of exact community recovery in the Geometric Stochastic Block Model (GSBM), where each vertex has an unknown community label as well as a known position, generated according to a Poisson point process in $\mathbb{R}^d$. Edges are formed independently conditioned on the community labels and positions, where vertices may only be connected by an edge if they are within a prescribed distance of each other. The GSBM thus favors the formation of dense local subgraphs, which commonly occur in real-world networks, a property that makes the GSBM qualitatively very different from the standard Stochastic Block Model (SBM). We propose a linear-time algorithm for exact community recovery, which succeeds down to the information-theoretic threshold, confirming a conjecture of Abbe, Baccelli, and Sankararaman. The algorithm involves two phases. The first phase exploits the density of local subgraphs to propagate estimated community labels among sufficiently occupied subregions, and produces an almost-exact vertex labeling. The second phase then refines the initial labels using a Poisson testing procedure. Thus, the GSBM enjoys local to global amplification just as the SBM, with the advantage of admitting an information-theoretically optimal, linear-time algorithm.
翻译:我们研究了几何随机块模型(Geometric Stochastic Block Model, GSBM)中精确社区恢复的问题。在此模型中,每个顶点除了具有未知的社区标签外,还拥有根据 $\mathbb{R}^d$ 中泊松点过程生成的已知位置。边的形成条件独立于社区标签和位置,只有彼此间距在指定距离内的顶点才可能相连。因此,GSBM 倾向于形成密集的局部子图,这一特性常见于真实世界网络,使得 GSBM 在性质上与标准随机块模型(Stochastic Block Model, SBM)存在显著差异。我们提出了一种用于精确社区恢复的线性时间算法,该算法能够在信息论阈值以下成功恢复,验证了 Abbe、Baccelli 和 Sankararaman 的猜想。该算法包含两个阶段:第一阶段利用局部子图的密度,在充分占据的子区域间传播估计的社区标签,从而得到近乎精确的顶点标记;第二阶段则通过泊松检验程序对初始标签进行精炼。因此,GSBM 像 SBM 一样实现了从局部到整体的信息放大,同时具备采用信息论最优的线性时间算法的优势。