The ubiquity of large-scale graphs in node-classification tasks significantly hinders the real-world applications of Graph Neural Networks (GNNs). Node sampling, graph coarsening, and dataset condensation are effective strategies for enhancing data efficiency. However, owing to the interdependence of graph nodes, coreset selection, which selects subsets of the data examples, has not been successfully applied to speed up GNN training on large graphs, warranting special treatment. This paper studies graph coresets for GNNs and avoids the interdependence issue by selecting ego-graphs (i.e., neighborhood subgraphs around a node) based on their spectral embeddings. We decompose the coreset selection problem for GNNs into two phases: a coarse selection of widely spread ego graphs and a refined selection to diversify their topologies. We design a greedy algorithm that approximately optimizes both objectives. Our spectral greedy graph coreset (SGGC) scales to graphs with millions of nodes, obviates the need for model pre-training, and applies to low-homophily graphs. Extensive experiments on ten datasets demonstrate that SGGC outperforms other coreset methods by a wide margin, generalizes well across GNN architectures, and is much faster than graph condensation.
翻译:节点分类任务中大规模图的普遍存在显著阻碍了图神经网络(GNNs)的实际应用。节点采样、图粗化和数据集压缩是提升数据效率的有效策略。然而,由于图节点间的相互依赖性,核集选择(即选取数据样本的子集)尚未能成功应用于加速大规模图上的GNN训练,需要特殊处理。本文研究面向GNN的图核集,通过基于谱嵌入选择自我图(即节点周围的邻域子图)来规避相互依赖性问题。我们将GNN的核集选择问题分解为两个阶段:广泛分布的自我图的粗选阶段,以及拓扑结构多样化的精选阶段。我们设计了一种贪婪算法,以近似优化这两个目标。我们的谱贪婪图核集(SGGC)可扩展至百万节点规模的图,无需模型预训练,并适用于低同配性图。在十个数据集上的大量实验表明,SGGC大幅优于其他核集方法,能良好泛化至不同GNN架构,且速度远快于图压缩方法。