Graph neural networks (GNNs) have achieved great success in node classification tasks. However, existing GNNs naturally bias towards the majority classes with more labelled data and ignore those minority classes with relatively few labelled ones. The traditional techniques often resort over-sampling methods, but they may cause overfitting problem. More recently, some works propose to synthesize additional nodes for minority classes from the labelled nodes, however, there is no any guarantee if those generated nodes really stand for the corresponding minority classes. In fact, improperly synthesized nodes may result in insufficient generalization of the algorithm. To resolve the problem, in this paper we seek to automatically augment the minority classes from the massive unlabelled nodes of the graph. Specifically, we propose \textit{GraphSR}, a novel self-training strategy to augment the minority classes with significant diversity of unlabelled nodes, which is based on a Similarity-based selection module and a Reinforcement Learning(RL) selection module. The first module finds a subset of unlabelled nodes which are most similar to those labelled minority nodes, and the second one further determines the representative and reliable nodes from the subset via RL technique. Furthermore, the RL-based module can adaptively determine the sampling scale according to current training data. This strategy is general and can be easily combined with different GNNs models. Our experiments demonstrate the proposed approach outperforms the state-of-the-art baselines on various class-imbalanced datasets.
翻译:图神经网络(GNNs)在节点分类任务中取得了巨大成功。然而,现有GNNs天然偏向于具有更多标注数据的多数类,而忽视标注样本较少的少数类。传统方法常采用过采样策略,但可能引发过拟合问题。近期,部分研究提出从标注节点为少数类合成额外节点,然而这些生成节点能否真正代表对应少数类缺乏任何保障。事实上,不当合成的节点可能导致算法泛化能力不足。为解决此问题,本文从图中海量未标注节点中自动增强少数类。具体而言,我们提出一种基于相似性选择模块和强化学习(RL)选择模块的新型自训练策略GraphSR,利用未标注节点的显著多样性来增强少数类。前者从未标注节点中筛选出与标注少数节点最相似的子集,后者通过RL技术从该子集中进一步确定具有代表性和可靠性的节点。此外,基于RL的模块可根据当前训练数据自适应调整采样规模。该策略具有通用性,可便捷地与不同GNN模型结合。实验结果表明,所提方法在多种类别不平衡数据集上均优于现有最优基线方法。