Graph clustering, a fundamental and challenging task in graph mining, aims to classify nodes in a graph into several disjoint clusters. In recent years, graph contrastive learning (GCL) has emerged as a dominant line of research in graph clustering and advances the new state-of-the-art. However, GCL-based methods heavily rely on graph augmentations and contrastive schemes, which may potentially introduce challenges such as semantic drift and scalability issues. Another promising line of research involves the adoption of modularity maximization, a popular and effective measure for community detection, as the guiding principle for clustering tasks. Despite the recent progress, the underlying mechanism of modularity maximization is still not well understood. In this work, we dig into the hidden success of modularity maximization for graph clustering. Our analysis reveals the strong connections between modularity maximization and graph contrastive learning, where positive and negative examples are naturally defined by modularity. In light of our results, we propose a community-aware graph clustering framework, coined MAGI, which leverages modularity maximization as a contrastive pretext task to effectively uncover the underlying information of communities in graphs, while avoiding the problem of semantic drift. Extensive experiments on multiple graph datasets verify the effectiveness of MAGI in terms of scalability and clustering performance compared to state-of-the-art graph clustering methods. Notably, MAGI easily scales a sufficiently large graph with 100M nodes while outperforming strong baselines.
翻译:图聚类作为图挖掘领域一项基础且具有挑战性的任务,旨在将图中的节点划分为若干互不相交的簇。近年来,图对比学习已成为图聚类研究的主流方向,并不断刷新最佳性能记录。然而,基于图对比学习的方法严重依赖于图数据增强与对比策略,可能引入语义漂移和可扩展性等问题。另一条富有前景的研究路线采用模块度最大化这一社区检测中流行且有效的度量指标,作为聚类任务的指导原则。尽管近期取得进展,模块度最大化的内在机制仍未得到充分理解。本研究深入探究模块度最大化在图聚类中取得成功的潜在原因。我们的分析揭示了模块度最大化与图对比学习之间的紧密联系,其中正负样本可通过模块度自然定义。基于此发现,我们提出一种社区感知的图聚类框架,命名为MAGI。该框架利用模块度最大化作为对比式预训练任务,有效挖掘图中社区的潜在信息,同时避免语义漂移问题。在多个图数据集上的大量实验表明,相较于当前最先进的图聚类方法,MAGI在可扩展性与聚类性能方面均表现出优越性。值得注意的是,MAGI能够轻松扩展到包含1亿个节点的大规模图数据,并在性能上超越现有强基线方法。