This paper studies Large Language Models (LLMs) augmented with structured data--particularly graphs--a crucial data modality that remains underexplored in the LLM literature. We aim to understand when and why the incorporation of structural information inherent in graph data can improve the prediction performance of LLMs on node classification tasks with textual features. To address the ``when'' question, we examine a variety of prompting methods for encoding structural information, in settings where textual node features are either rich or scarce. For the ``why'' questions, we probe into two potential contributing factors to the LLM performance: data leakage and homophily. Our exploration of these questions reveals that (i) LLMs can benefit from structural information, especially when textual node features are scarce; (ii) there is no substantial evidence indicating that the performance of LLMs is significantly attributed to data leakage; and (iii) the performance of LLMs on a target node is strongly positively related to the local homophily ratio of the node\footnote{Codes and datasets are at: \url{https://github.com/TRAIS-Lab/LLM-Structured-Data}}.
翻译:本文研究大语言模型(LLMs)与结构化数据——特别是图数据的结合——这是LLM文献中尚未充分探索的关键数据模态。我们旨在理解在具有文本特征的节点分类任务中,融入图数据固有的结构信息何时以及为何能提升LLM的预测性能。针对“何时”问题,我们考察了在文本节点特征丰富或稀缺的场景下,多种编码结构信息的提示方法。针对“为何”问题,我们探究了影响LLM性能的两个潜在因素:数据泄露与同质性。通过探索这些问题,我们发现:(i) LLM能够从结构信息中获益,尤其在文本节点特征稀缺时;(ii) 无明显证据表明LLM的性能显著归因于数据泄露;(iii) 目标节点上LLM的性能与该节点的局部同质性比率呈强正相关\footnote{代码和数据集见:\url{https://github.com/TRAIS-Lab/LLM-Structured-Data}}。