From Moments to Models: Graphon Mixture-Aware Mixup and Contrastive Learning

Real-world graph datasets often consist of mixtures of populations, where graphs are generated from multiple distinct underlying distributions. However, modern representation learning approaches, such as graph contrastive learning (GCL) and augmentation methods like Mixup, typically overlook this mixture structure. In this work, we propose a unified framework that explicitly models data as a mixture of underlying probabilistic graph generative models represented by graphons. To characterize these graphons, we leverage graph moments (motif densities) to cluster graphs arising from the same model. This enables us to disentangle the mixture components and identify their distinct generative mechanisms. This model-aware partitioning benefits two key graph learning tasks: 1) It enables a graphon-mixture-aware mixup (GMAM), a data augmentation technique that interpolates in a semantically valid space guided by the estimated graphons, instead of assuming a single graphon per class. 2) For GCL, it enables model-adaptive and principled augmentations. Additionally, by introducing a new model-aware objective, our proposed approach (termed MGCL) improves negative sampling by restricting negatives to graphs from other models. We establish a key theoretical guarantee: a novel, tighter bound showing that graphs sampled from graphons with small cut distance will have similar motif densities with high probability. Extensive experiments on benchmark datasets demonstrate strong empirical performance. In unsupervised learning, MGCL achieves state-of-the-art results, obtaining the top average rank across eight datasets. In supervised learning, GMAM consistently outperforms existing strategies, achieving new state-of-the-art accuracy in 6 out of 7 datasets.

翻译：现实世界的图数据集通常由多个群体混合而成，其中图数据源自多个不同的底层分布生成。然而，现代表示学习方法，如图对比学习（GCL）以及诸如Mixup等数据增强方法，通常忽视了这种混合结构。本文提出一个统一框架，将数据显式建模为由图混合表示的底层概率图生成模型的混合。为刻画这些图混合，我们利用图矩（模体密度）对源自同一模型的图进行聚类。这使得我们能够解耦混合成分并识别其独特的生成机制。这种模型感知的划分有益于两个关键的图学习任务：1）它实现了图混合感知的混合（GMAM），这是一种数据增强技术，在由估计图混合引导的语义有效空间中进行插值，而非假设每个类别仅对应单一图混合。2）对于GCL，它实现了模型自适应且原理驱动的数据增强。此外，通过引入一种新的模型感知目标函数，我们提出的方法（称为MGCL）通过将负样本限制为来自其他模型的图，从而改进了负采样策略。我们建立了一个关键的理论保证：提出了一种新颖且更紧的界，证明从具有较小割距离的图混合中采样的图，将以高概率具有相似的模体密度。在基准数据集上的大量实验展示了强大的实证性能。在无监督学习中，MGCL取得了最先进的结果，在八个数据集上获得了最高的平均排名。在有监督学习中，GMAM持续优于现有策略，在七个数据集中的六个上实现了新的最先进准确率。