Learning the community structure of a large-scale graph is a fundamental problem in machine learning, computer science and statistics. We study the problem of exactly recovering the communities in a graph generated from the Stochastic Block Model (SBM) in the Massively Parallel Computation (MPC) model. Specifically, given $kn$ vertices that are partitioned into $k$ equal-sized clusters (i.e., each has size $n$), a graph on these $kn$ vertices is randomly generated such that each pair of vertices is connected with probability~$p$ if they are in the same cluster and with probability $q$ if not, where $p > q > 0$. We give MPC algorithms for the SBM in the (very general) \emph{$s$-space MPC model}, where each machine has memory $s=\Omega(\log n)$. Under the condition that $\frac{p-q}{\sqrt{p}}\geq \tilde{\Omega}(k^{\frac12}n^{-\frac12+\frac{1}{2(r-1)}})$ for any integer $r\in [3,O(\log n)]$, our first algorithm exactly recovers all the $k$ clusters in $O(kr\log_s n)$ rounds using $\tilde{O}(m)$ total space, or in $O(r\log_s n)$ rounds using $\tilde{O}(km)$ total space. If $\frac{p-q}{\sqrt{p}}\geq \tilde{\Omega}(k^{\frac34}n^{-\frac14})$, our second algorithm achieves $O(\log_s n)$ rounds and $\tilde{O}(m)$ total space complexity. Both algorithms significantly improve upon a recent result of Cohen-Addad et al. [PODC'22], who gave algorithms that only work in the \emph{sublinear space MPC model}, where each machine has local memory~$s=O(n^{\delta})$ for some constant $\delta>0$, with a much stronger condition on $p,q,k$. Our algorithms are based on collecting the $r$-step neighborhood of each vertex and comparing the difference of some statistical information generated from the local neighborhoods for each pair of vertices. To implement the clustering algorithms in parallel, we present efficient approaches for implementing some basic graph operations in the $s$-space MPC model.
翻译:学习大规模图的社区结构是机器学习、计算机科学和统计学中的一个基本问题。我们研究在超大规模并行计算(MPC)模型下,从随机块模型(SBM)生成的图中精确恢复社区的问题。具体而言,给定 $kn$ 个顶点,这些顶点被划分为 $k$ 个大小相等的簇(即每个簇大小为 $n$),在此 $kn$ 个顶点上随机生成一个图,其中同一簇内的每对顶点以概率 $p$ 连接,不同簇的顶点以概率 $q$ 连接,且满足 $p > q > 0$。我们针对(非常通用的)$s$-空间 MPC 模型给出 SBM 的 MPC 算法,其中每台机器的内存为 $s=\Omega(\log n)$。在条件 $\frac{p-q}{\sqrt{p}}\geq \tilde{\Omega}(k^{\frac12}n^{-\frac12+\frac{1}{2(r-1)}})$(对任意整数 $r\in [3,O(\log n)]$)下,我们的第一个算法能够在 $O(kr\log_s n)$ 轮内使用 $\tilde{O}(m)$ 总空间精确恢复所有 $k$ 个簇,或在 $O(r\log_s n)$ 轮内使用 $\tilde{O}(km)$ 总空间实现。若 $\frac{p-q}{\sqrt{p}}\geq \tilde{\Omega}(k^{\frac34}n^{-\frac14})$,我们的第二个算法达到 $O(\log_s n)$ 轮和 $\tilde{O}(m)$ 总空间复杂度。两个算法均显著改进了 Cohen-Addad 等人 [PODC'22] 的最新结果——后者仅在每台机器局部内存 $s=O(n^{\delta})$(对于常数 $\delta>0$)的次线性空间 MPC 模型下有效,且对 $p,q,k$ 有更强条件。我们的算法基于收集每个顶点的 $r$ 步邻域,并比较每对顶点局部邻域生成的统计信息差异。为实现并行聚类算法,我们提出了在 $s$-空间 MPC 模型中实现基本图操作的高效方法。