We delve into the challenge of semi-supervised node classification on the Contextual Stochastic Block Model (CSBM) dataset. Here, nodes from the two-cluster Stochastic Block Model (SBM) are coupled with feature vectors, which are derived from a Gaussian Mixture Model (GMM) that corresponds to their respective node labels. With only a subset of the CSBM node labels accessible for training, our primary objective becomes the accurate classification of the remaining nodes. Venturing into the transductive learning landscape, we, for the first time, pinpoint the information-theoretical threshold for the exact recovery of all test nodes in CSBM. Concurrently, we design an optimal spectral estimator inspired by Principal Component Analysis (PCA) with the training labels and essential data from both the adjacency matrix and feature vectors. We also evaluate the efficacy of graph ridge regression and Graph Convolutional Networks (GCN) on this synthetic dataset. Our findings underscore that graph ridge regression and GCN possess the ability to achieve the information threshold of exact recovery in a manner akin to the optimal estimator when using the optimal weighted self-loops. This highlights the potential role of feature learning in augmenting the proficiency of GCN, especially in the realm of semi-supervised learning.
翻译:本文深入研究了上下文随机块模型(CSBM)数据集上的半监督节点分类问题。在该模型中,来自双簇随机块模型(SBM)的节点与特征向量相结合,这些特征向量源自符合其节点标签的高斯混合模型(GMM)。当仅能获取CSBM部分节点标签进行训练时,我们的核心目标是对剩余节点实现精确分类。在转导学习框架下,我们首次确定了CSBM中所有测试节点实现精确恢复的信息理论阈值。同时,我们基于训练标签及邻接矩阵与特征向量的关键数据,设计了一种受主成分分析(PCA)启发的最优谱估计器。此外,我们评估了图岭回归和图卷积网络(GCN)在此合成数据集上的性能。研究结果表明,当采用最优加权自循环时,图岭回归和GCN能够以类似最优估计器的方式达到精确恢复的信息阈值。这凸显了特征学习在提升GCN能力方面的潜在作用,特别是在半监督学习领域。