Graph Neural Network (GNN) has demonstrated extraordinary performance in classifying graph properties. However, due to the selection bias of training and testing data (e.g., training on small graphs and testing on large graphs, or training on dense graphs and testing on sparse graphs), distribution deviation is widespread. More importantly, we often observe \emph{hybrid structure distribution shift} of both scale and density, despite of one-sided biased data partition. The spurious correlations over hybrid distribution deviation degrade the performance of previous GNN methods and show large instability among different datasets. To alleviate this problem, we propose \texttt{OOD-GMixup} to jointly manipulate the training distribution with \emph{controllable data augmentation} in metric space. Specifically, we first extract the graph rationales to eliminate the spurious correlations due to irrelevant information. Secondly, we generate virtual samples with perturbation on graph rationale representation domain to obtain potential OOD training samples. Finally, we propose OOD calibration to measure the distribution deviation of virtual samples by leveraging Extreme Value Theory, and further actively control the training distribution by emphasizing the impact of virtual OOD samples. Extensive studies on several real-world datasets on graph classification demonstrate the superiority of our proposed method over state-of-the-art baselines.
翻译:摘要:图神经网络在图的属性分类任务中展现出了卓越性能。然而,由于训练数据与测试数据的选择性偏差(例如在小图上训练而在大图上测试,或在密集图上训练而在稀疏图上测试),分布偏移现象广泛存在。更重要的是,尽管数据划分仅呈现单侧偏差,我们仍常观察到尺度与密度兼具的混合结构分布偏移。混合分布偏差产生的虚假相关性不仅削弱了现有图神经网络的性能,还在不同数据集间表现出显著的不稳定性。为解决该问题,我们提出OOD-GMixup方法,通过度量空间中的可控数据增强协同调控训练分布。具体而言,首先提取图原理解释以消除不相关信息的虚假相关性;其次,对图原理解释域施加扰动生成虚拟样本,获取潜在分布外训练样本;最终,提出分布外校准机制,基于极值理论度量虚拟样本的分布偏移程度,并通过增强虚拟分布外样本的影响力主动控制训练分布。在多个真实世界图分类数据集上的大量实验表明,所提方法显著优于现有最先进基线模型。