Large graph datasets make training graph neural networks (GNNs) computationally costly. Graph condensation methods address this by generating small synthetic graphs that approximate the original data. However, existing approaches rely on clean, supervised labels, which limits their reliability when labels are scarce, noisy, or inconsistent. We propose Pseudo-Labeled Graph Condensation (PLGC), a self-supervised framework that constructs latent pseudo-labels from node embeddings and optimizes condensed graphs to match the original graph's structural and feature statistics -- without requiring ground-truth labels. PLGC offers three key contributions: (1) A diagnosis of why supervised condensation fails under label noise and distribution shift. (2) A label-free condensation method that jointly learns latent prototypes and node assignments. (3) Theoretical guarantees showing that pseudo-labels preserve latent structural statistics of the original graph and ensure accurate embedding alignment. Empirically, across node classification and link prediction tasks, PLGC achieves competitive performance with state-of-the-art supervised condensation methods on clean datasets and exhibits substantial robustness under label noise, often outperforming all baselines by a significant margin. Our findings highlight the practical and theoretical advantages of self-supervised graph condensation in noisy or weakly-labeled environments.
翻译:大规模图数据集使得图神经网络(GNNs)的训练计算成本高昂。图压缩方法通过生成近似原始数据的小型合成图来解决这一问题。然而,现有方法依赖于干净、有监督的标签,这在标签稀缺、噪声较大或不一致时限制了其可靠性。我们提出伪标签图压缩(PLGC),这是一种自监督框架,它从节点嵌入中构建潜在伪标签,并优化压缩图以匹配原始图的结构和特征统计量——无需真实标签。PLGC提供了三个关键贡献:(1)诊断了有监督压缩在标签噪声和分布偏移下失效的原因。(2)一种无标签压缩方法,能够联合学习潜在原型和节点分配。(3)理论保证表明伪标签能够保留原始图的潜在结构统计量,并确保准确的嵌入对齐。实证研究表明,在节点分类和链接预测任务中,PLGC在干净数据集上达到了与最先进的有监督压缩方法相竞争的性能,并在标签噪声下表现出显著的鲁棒性,通常以较大优势超越所有基线方法。我们的发现凸显了自监督图压缩在噪声或弱标签环境中的实践与理论优势。