Graph contrastive learning (GCL) has recently achieved substantial advancements. Existing GCL approaches compare two different ``views'' of the same graph in order to learn node/graph representations. The underlying assumption of these studies is that the graph augmentation strategy is capable of generating several different graph views such that the graph views are structurally different but semantically similar to the original graphs, and thus the ground-truth labels of the original and augmented graph/nodes can be regarded identical in contrastive learning. However, we observe that this assumption does not always hold. For instance, the deletion of a super-node within a social network can exert a substantial influence on the partitioning of communities for other nodes. Similarly, any perturbation to nodes or edges in a molecular graph will change the labels of the graph. Therefore, we believe that augmenting the graph, accompanied by an adaptation of the labels used for the contrastive loss, will facilitate the encoder to learn a better representation. Based on this idea, we propose ID-MixGCL, which allows the simultaneous interpolation of input nodes and corresponding identity labels to obtain soft-confidence samples, with a controllable degree of change, leading to the capture of fine-grained representations from self-supervised training on unlabeled graphs. Experimental results demonstrate that ID-MixGCL improves performance on graph classification and node classification tasks, as demonstrated by significant improvements on the Cora, IMDB-B, IMDB-M, and PROTEINS datasets compared to state-of-the-art techniques, by 3-29% absolute points.
翻译:摘要:图对比学习(GCL)近年来取得了显著进展。现有GCL方法通过比较同一张图的两个不同“视角”来学习节点/图表示。这些研究的潜在假设是:图增强策略能够生成若干种不同的图视角,这些视角在结构上与原图相异,但在语义上与原图相似,因此对比学习中原始图与增强图/节点的真实标签可视为相同。然而,我们观察到这一假设并非始终成立。例如,社交网络中超级节点的删除会显著影响其他节点的社区划分;同样,分子图中任意节点或边的扰动都会改变图的标签。因此,我们认为对图进行增强并调整对比损失所用的标签,将有助于编码器学习更优的表示。基于这一思想,我们提出ID-MixGCL,该方法允许同时插值输入节点及其对应的身份标签,从而获得可控变化程度的软置信样本,进而在无标签图的自监督训练中捕获细粒度表示。实验结果表明,ID-MixGCL在图分类与节点分类任务上均提升了性能。与最先进技术相比,其在Cora、IMDB-B、IMDB-M与PROTEINS数据集上实现了3–29%的绝对提升。