Generating realistic synthetic citation, patent, or component dependency networks is essential for benchmarking community detection, graph visualisation, and network data mining algorithms. We present the first systematic comparison of generators of directed graphs that are nearly acyclic and have a ground-truth community structure. We evaluate 12 methods across 7 real citation networks and 26 metrics. We propose the practice of reversing directions of edges in static generators to break cycles and induce a citation-like flow, which significantly improves the performance of a degree-corrected Stochastic Block Model. Our novel methodological approach to evaluating community detection benchmarks distinguishes between endogenous and exogenous mesoscopic similarities, with the latter proving more important. This distinction reveals that high-parameter models suffer from overfitting by memorising planted community statistics which lead to their failing to produce realistic networks. Finally, we introduce the Citation Seeder (CS) algorithm, an iterative generator grounded in the Price-Pareto model of citation networks, with interpretable parameters and O(N+E) runtime. CS achieves competitive results against the best-performing baselines while using up to four orders of magnitude fewer parameters and providing a clean framework for explaining and predicting a network's future growth.
翻译:生成逼真的合成引文、专利或组件依赖网络,对于社区检测、图可视化和网络数据挖掘算法的基准测试至关重要。本文首次对近乎无环且具有真实社区结构的有向图生成方法进行了系统性比较。我们评估了12种方法,涵盖7个真实引文网络和26项指标。我们提出在静态生成器中反转边方向的实践,以打破循环并诱导类似引文的流结构,这显著提升了度修正随机块模型的性能。我们评估社区检测基准的新方法论方法,区分了内源性介观相似性与外源性介观相似性,后者被证明更为关键。这一区分揭示了高参数模型因记忆植入的社区统计特征而出现过拟合,导致其无法生成真实网络。最后,我们提出引文播种(CS)算法——一种基于引文网络Price-Pareto模型的迭代生成器,具有可解释参数和O(N+E)运行时间复杂度。CS在参数数量减少多达四个数量级的同时,仍能与最优基线方法竞争,并为解释和预测网络未来增长提供了清晰框架。