Node classification on graphs frequently encounters the challenge of class imbalance, leading to biased performance and posing significant risks in real-world applications. Although several data-centric solutions have been proposed, none of them focus on Text-Attributed Graphs (TAGs), and therefore overlook the potential of leveraging the rich semantics encoded in textual features for boosting the classification of minority nodes. Given this crucial gap, we investigate the possibility of augmenting graph data in the text space, leveraging the textual generation power of Large Language Models (LLMs) to handle imbalanced node classification on TAGs. Specifically, we propose a novel approach called LA-TAG (LLM-based Augmentation on Text-Attributed Graphs), which prompts LLMs to generate synthetic texts based on existing node texts in the graph. Furthermore, to integrate these synthetic text-attributed nodes into the graph, we introduce a text-based link predictor to connect the synthesized nodes with the existing nodes. Our experiments across multiple datasets and evaluation metrics show that our framework significantly outperforms traditional non-textual-based data augmentation strategies and specific node imbalance solutions. This highlights the promise of using LLMs to resolve imbalance issues on TAGs.
翻译:图节点分类任务常面临类别不平衡的挑战,导致模型性能出现偏差,并在实际应用中构成显著风险。尽管已有若干以数据为中心的解决方案被提出,但均未聚焦于文本属性图,因此忽视了利用文本特征中蕴含的丰富语义信息来提升少数类别节点分类性能的潜力。针对这一关键空白,本研究探索在文本空间进行图数据增强的可能性,借助大语言模型的文本生成能力来处理文本属性图上的不平衡节点分类问题。具体而言,我们提出一种名为LA-TAG(基于LLM的文本属性图增强)的新方法,该方法通过提示大语言模型基于图中现有节点的文本生成合成文本。进一步地,为将这些合成的文本属性节点整合到图中,我们引入了一种基于文本的链接预测器,用于将合成节点与现有节点相连接。我们在多个数据集和评估指标上的实验表明,该框架显著优于传统的非文本数据增强策略及特定的节点不平衡解决方案。这凸显了利用大语言模型解决文本属性图不平衡问题的广阔前景。