Can LLMs Convert Graphs to Text-Attributed Graphs?

Graphs are ubiquitous data structures found in numerous real-world applications, such as drug discovery, recommender systems, and social network analysis. Graph neural networks (GNNs) have become a popular tool to learn node embeddings through message passing on these structures. However, a significant challenge arises when applying GNNs to multiple graphs with different feature spaces, as existing GNN architectures are not designed for cross-graph feature alignment. To address this, recent approaches introduce text-attributed graphs, where each node is associated with a textual description, enabling the use of a shared textual encoder to project nodes from different graphs into a unified feature space. While promising, this method relies heavily on the availability of text-attributed data, which can be difficult to obtain in practice. To bridge this gap, we propose a novel method named Topology-Aware Node description Synthesis (TANS), which leverages large language models (LLMs) to automatically convert existing graphs into text-attributed graphs. The key idea is to integrate topological information with each node's properties, enhancing the LLMs' ability to explain how graph topology influences node semantics. We evaluate our TANS on text-rich, text-limited, and text-free graphs, demonstrating that it enables a single GNN to operate across diverse graphs. Notably, on text-free graphs, our method significantly outperforms existing approaches that manually design node features, showcasing the potential of LLMs for preprocessing graph-structured data, even in the absence of textual information. The code and data are available at https://github.com/Zehong-Wang/TANS.

翻译：图是无处不在的数据结构，广泛应用于药物发现、推荐系统和社交网络分析等诸多现实场景。图神经网络已成为通过在这些结构上进行消息传递来学习节点嵌入的流行工具。然而，当将图神经网络应用于具有不同特征空间的多个图时，一个重大挑战随之出现，因为现有的图神经网络架构并非为跨图特征对齐而设计。为解决此问题，近期研究引入了文本属性图，其中每个节点都与文本描述相关联，从而能够使用共享的文本编码器将来自不同图的节点投影到统一的特征空间中。尽管前景广阔，但该方法严重依赖于文本属性数据的可用性，而这在实践中往往难以获取。为弥合这一差距，我们提出了一种名为拓扑感知节点描述合成的新方法，该方法利用大型语言模型自动将现有图转换为文本属性图。其核心思想是将拓扑信息与每个节点的属性相结合，从而增强大型语言模型解释图拓扑如何影响节点语义的能力。我们在文本丰富、文本有限和无文本的图上评估了我们的方法，结果表明它能使单个图神经网络在多样化的图上运行。值得注意的是，在无文本图上，我们的方法显著优于手动设计节点特征的现有方法，这展示了大型语言模型在图结构数据预处理方面的潜力，即使在缺乏文本信息的情况下也是如此。代码和数据可在 https://github.com/Zehong-Wang/TANS 获取。