As a well-known community detection algorithm, Leiden has been widely used in various scenarios such as large language model generation (e.g., Graph-RAG), anomaly detection, and biological analysis. In these scenarios, the graphs are often large and dynamic, where vertices and edges are inserted and deleted frequently, so it is costly to obtain the updated communities by Leiden from scratch when the graph has changed. Recently, one work has attempted to study how to maintain Leiden communities in the dynamic graph, but it lacks a detailed theoretical analysis, and its algorithms are inefficient for large graphs. To address these issues, in this paper, we first theoretically show that the existing algorithms are relatively unbounded via the boundedness analysis (a powerful tool for analyzing incremental algorithms on dynamic graphs), and also analyze the memberships of vertices in communities when the graph changes. Based on theoretical analysis, we develop a novel efficient maintenance algorithm, called Hierarchical Incremental Tree Leiden (HIT-Leiden), which effectively reduces the range of affected vertices by maintaining the connected components and hierarchical community structures. Comprehensive experiments in various datasets demonstrate the superior performance of HIT-Leiden. In particular, it achieves speedups of up to five orders of magnitude over existing methods.
翻译:作为一种著名的社区检测算法,Leiden已被广泛应用于大型语言模型生成(例如Graph-RAG)、异常检测和生物分析等多种场景。在这些场景中,图通常规模庞大且动态变化,顶点和边频繁插入与删除,因此当图发生变化时,从头开始运行Leiden算法来获取更新的社区成本高昂。近期,一项研究尝试探讨如何在动态图中维护Leiden社区,但其缺乏详细的理论分析,且所提算法对于大规模图效率低下。为解决这些问题,本文首先通过有界性分析(一种用于分析动态图上增量算法的强大工具)从理论上证明现有算法是相对无界的,并分析了图变化时顶点在社区中的归属情况。基于理论分析,我们提出了一种新颖的高效维护算法,称为层次化增量树Leiden(HIT-Leiden),该算法通过维护连通分量和层次化社区结构,有效缩小了受影响顶点的范围。在不同数据集上的综合实验表明HIT-Leiden具有优越的性能。特别地,相较于现有方法,其加速比最高可达五个数量级。