As networks continue to increase in size, current methods must be capable of handling large numbers of nodes and edges in order to be practically relevant. Instead of working directly with the entire (large) network, analyzing sub-networks has become a popular approach. Due to a network's inherent inter-connectedness, however, sub-sampling is not a trivial task. While this problem has gained popularity in recent years, it has not received sufficient attention from the statistics community. In this work, we provide a thorough comparison of seven graph sub-sampling algorithms by applying them to divide-and-conquer algorithms for community structure and core-periphery (CP) structure. After discussing the various algorithms and sub-sampling routines, we derive theoretical results for the mis-classification rate of the divide-and-conquer algorithm for CP structure under various sub-sampling schemes. We then perform extensive experiments on both simulated and real-world data to compare the various methods. For the community detection task, we found that sampling nodes uniformly at random yields the best performance, but that sometimes the base algorithm applied to the entire network yields better results both in terms of identification and computational time. For CP structure on the other hand, there was no single winner, but algorithms which sampled core nodes at a higher rate consistently outperformed other sampling routines, e.g., random edge sampling and random walk sampling. Unlike community detection, the CP divide-and-conquer algorithm tends to yield better identification results while also being faster than the base algorithm. The varying performance of the sampling algorithms on different tasks demonstrates the importance of carefully selecting a sub-sampling routine for the specific application.
翻译:随着网络规模持续扩大,现有方法必须能够处理海量节点与边才能具备实际应用价值。相较于直接处理整个(大规模)网络,分析子网络已成为一种主流研究范式。然而,由于网络固有的互联特性,子采样并非易事。尽管该问题近年备受关注,但统计学界尚未给予充分重视。本研究通过将七种图子采样算法应用于社区结构与核心-边缘(CP)结构的分治算法,进行了系统性比较。在详细讨论各类算法与子采样流程后,我们推导了不同子采样方案下CP结构分治算法误分类率的理论结果。随后通过大量模拟数据与真实数据实验对各方法进行比较。对于社区检测任务,我们发现均匀随机采样节点表现最佳,但有时直接对整个网络应用基础算法在识别精度与计算时间上均能获得更好结果。而对于CP结构检测,虽无绝对优势算法,但以更高概率采样核心节点的算法始终优于其他采样方法(如随机边采样与随机游走采样)。与社区检测不同,CP分治算法在获得更优识别结果的同时,其计算速度也普遍优于基础算法。不同采样算法在各类任务中的性能差异表明,针对具体应用场景精心选择子采样策略至关重要。