Graph neural networks (GNNs) have emerged as go-to models for node classification in graph data due to their powerful abilities in fusing graph structures and attributes. However, such models strongly rely on adequate high-quality labeled data for training, which are expensive to acquire in practice. With the advent of large language models (LLMs), a promising way is to leverage their superb zero-shot capabilities and massive knowledge for node labeling. Despite promising results reported, this methodology either demands considerable queries to LLMs, or suffers from compromised performance caused by noisy labels produced by LLMs. To remedy these issues, this work presents Cella, an active self-training framework that integrates LLMs into GNNs in a cost-effective manner. The design recipe of Cella is to iteratively identify small sets of "critical" samples using GNNs and extract informative pseudo-labels for them with both LLMs and GNNs as additional supervision signals to enhance model training. Particularly, Cella includes three major components: (i) an effective active node selection strategy for initial annotations; (ii) a judicious sample selection scheme to sift out the "critical" nodes based on label disharmonicity and entropy; and (iii) a label refinement module combining LLMs and GNNs with rewired topology. Our extensive experiments over five benchmark text-attributed graph datasets demonstrate that Cella significantly outperforms the state of the arts under the same query budget to LLMs in terms of label-free node classification. In particular, on the DBLP dataset with 14.3k nodes, Cella is able to achieve an 8.08% conspicuous improvement in accuracy over the state-of-the-art at a cost of less than one cent.
翻译:图神经网络(GNNs)因其融合图结构与属性的强大能力,已成为图数据节点分类的首选模型。然而,此类模型严重依赖于充足的高质量标注数据进行训练,而这些数据在实际中获取成本高昂。随着大型语言模型(LLMs)的出现,利用其卓越的零样本能力和海量知识进行节点标注成为一种前景广阔的方法。尽管已有研究报道了积极成果,但该方法要么需要向LLMs发起大量查询,要么因LLMs产生的噪声标签而导致性能受损。为弥补这些不足,本研究提出了Cella——一种将LLMs以经济高效的方式整合进GNNs的主动自训练框架。Cella的设计思路是:通过GNNs迭代识别小规模“关键”样本集,并利用LLMs和GNNs作为额外监督信号为其提取信息性伪标签,从而增强模型训练。具体而言,Cella包含三个核心组件:(i)用于初始标注的高效主动节点选择策略;(ii)基于标签不和谐性与熵的审慎样本选择方案,以筛选出“关键”节点;(iii)结合LLMs与GNNs并引入拓扑重连的标签优化模块。我们在五个基准文本属性图数据集上的大量实验表明,在相同LLMs查询预算下,Cella在无标签节点分类任务上显著优于现有最优方法。特别是在包含14.3k个节点的DBLP数据集上,Cella能以低于一美分的成本,在准确率上实现8.08%的显著提升。