Temporal graph clustering (TGC) is a crucial task in temporal graph learning. Its focus is on node clustering on temporal graphs, and it offers greater flexibility for large-scale graph structures due to the mechanism of temporal graph methods. However, the development of TGC is currently constrained by a significant problem: the lack of suitable and reliable large-scale temporal graph datasets to evaluate clustering performance. In other words, most existing temporal graph datasets are in small sizes, and even large-scale datasets contain only a limited number of available node labels. It makes evaluating models for large-scale temporal graph clustering challenging. To address this challenge, we build arXiv4TGC, a set of novel academic datasets (including arXivAI, arXivCS, arXivMath, arXivPhy, and arXivLarge) for large-scale temporal graph clustering. In particular, the largest dataset, arXivLarge, contains 1.3 million labeled available nodes and 10 million temporal edges. We further compare the clustering performance with typical temporal graph learning models on both previous classic temporal graph datasets and the new datasets proposed in this paper. The clustering performance on arXiv4TGC can be more apparent for evaluating different models, resulting in higher clustering confidence and more suitable for large-scale temporal graph clustering. The arXiv4TGC datasets are publicly available at: https://github.com/MGitHubL/arXiv4TGC.
翻译:时序图聚类(TGC)是时序图学习中的关键任务。其核心是对时序图中的节点进行聚类,由于时序图方法的机制,该任务能够为大规模图结构提供更大的灵活性。然而,目前TGC的发展面临一个显著问题:缺乏合适且可靠的大规模时序图数据集来评估聚类性能。换言之,现有的大多数时序图数据集规模较小,即使是大规模数据集,其可用的节点标签数量也十分有限。这使得大规模时序图聚类的模型评估变得困难。为解决这一挑战,我们构建了arXiv4TGC,这是一组用于大规模时序图聚类的新型学术数据集(包括arXivAI、arXivCS、arXivMath、arXivPhy和arXivLarge)。其中,最大的数据集arXivLarge包含130万个带标签的可用节点和1000万条时序边。我们进一步在经典的时序图数据集以及本文提出的新数据集上,比较了典型时序图学习模型的聚类性能。在arXiv4TGC上的聚类结果能更清晰地评估不同模型,从而提供更高的聚类置信度,并更适用于大规模时序图聚类。arXiv4TGC数据集已在https://github.com/MGitHubL/arXiv4TGC 公开。