Massive network datasets are becoming increasingly common in scientific applications. Existing community detection methods encounter significant computational challenges for such massive networks due to two reasons. First, the full network needs to be stored and analyzed on a single server, leading to high memory costs. Second, existing methods typically use matrix factorization or iterative optimization using the full network, resulting in high runtimes. We propose a strategy called \textit{predictive assignment} to enable computationally efficient community detection while ensuring statistical accuracy. The core idea is to avoid large-scale matrix computations by breaking up the task into a smaller matrix computation plus a large number of vector computations that can be carried out in parallel. Under the proposed method, community detection is carried out on a small subgraph to estimate the relevant model parameters. Next, each remaining node is assigned to a community based on these estimates. We prove that predictive assignment achieves strong consistency under the stochastic blockmodel and its degree-corrected version. We also demonstrate the empirical performance of predictive assignment on simulated networks and two large real-world datasets: DBLP (Digital Bibliography \& Library Project), a computer science bibliographical database, and the Twitch Gamers Social Network.
翻译:大规模网络数据集在科学应用中正变得越来越普遍。现有的社区检测方法在处理此类大规模网络时面临显著的计算挑战,原因有二:首先,整个网络需要存储在单个服务器上进行分析,导致高昂的内存成本;其次,现有方法通常采用矩阵分解或基于全网络的迭代优化,造成较长的运行时间。本文提出一种称为"预测分配"的策略,在保证统计精度的同时实现计算高效的社区检测。其核心思想是通过将任务分解为小规模矩阵计算与大量可并行执行的向量计算,避免大规模矩阵运算。在该方法中,首先在小型子图上进行社区检测以估计相关模型参数,随后基于这些估计将剩余节点分配到相应社区。我们证明,在随机块模型及其度校正版本下,预测分配方法具有强一致性。通过模拟网络和两个大型真实数据集——计算机科学文献数据库DBLP(数字书目与图书馆项目)和Twitch游戏玩家社交网络,我们验证了预测分配方法的实证性能。