The rapid growth of graph data creates significant scalability challenges as most graph algorithms scale quadratically with size. To mitigate these issues, Graph Condensation (GC) methods have been proposed to learn a small graph from a larger one, accelerating downstream tasks. However, existing approaches critically assume a static training set, which conflicts with the inherently dynamic and evolving nature of real-world graph data. This work introduces a novel framework for continual graph condensation, enabling efficient updates to the distilled graph that handle data streams without requiring costly retraining. This limitation leads to inefficiencies when condensing growing training sets. In this paper, we introduce GECC (\underline{G}raph \underline{E}volving \underline{C}lustering \underline{C}ondensation), a scalable graph condensation method designed to handle large-scale and evolving graph data. GECC employs a traceable and efficient approach by performing class-wise clustering on aggregated features. Furthermore, it can inherit previous condensation results as clustering centroids when the condensed graph expands, thereby attaining an evolving capability. This methodology is supported by robust theoretical foundations and demonstrates superior empirical performance. Comprehensive experiments including real world scenario show that GECC achieves better performance than most state-of-the-art graph condensation methods while delivering an around 1000$\times$ speedup on large datasets.
翻译:图数据的快速增长带来了显著的可扩展性挑战,因为大多数图算法的计算复杂度随数据规模呈二次方增长。为缓解这一问题,研究者提出了图压缩方法,通过从大图中学习一个小图来加速下游任务。然而,现有方法关键性地假设训练集是静态的,这与真实世界图数据固有的动态演进特性相矛盾。这一局限性导致在压缩不断增长的训练集时效率低下。为此,本文引入了一种新型连续图压缩框架,能够高效更新压缩图以处理数据流,无需代价高昂的重新训练。具体地,我们提出了GECC(图演进聚类压缩)——一种可扩展的图压缩方法,专门用于处理大规模且不断演进的图数据。GECC通过对聚合特征执行按类聚类,采用可追溯且高效的方法。此外,当压缩图扩展时,它能将先前压缩结果作为聚类中心继承,从而获得演进能力。该方法具备坚实的理论基础,并展现出优越的实证性能。涵盖真实场景的全面实验表明,GECC在达到最先进图压缩方法性能的同时,在大数据集上实现了约1000倍的加速比。