Actions Speak Louder than Prompts: A Large-Scale Study of LLMs for Graph Inference

Large language models (LLMs) are increasingly used for text-rich graph machine learning tasks such as node classification in high-impact domains like fraud detection and recommendation systems. Yet, despite a surge of interest, the field lacks a principled understanding of the capabilities of LLMs in their interaction with graph data. In this work, we conduct a large-scale, controlled evaluation across several key axes of variability to systematically assess the strengths and weaknesses of LLM-based graph reasoning methods in text-based applications. The axes include the LLM-graph interaction mode, comparing prompting, tool-use, and code generation; dataset domains, spanning citation, web-link, e-commerce, and social networks; structural regimes contrasting homophilic and heterophilic graphs; feature characteristics involving both short- and long-text node attributes; and model configurations with varying LLM sizes and reasoning capabilities. We further analyze dependencies by methodically truncating features, deleting edges, and removing labels to quantify reliance on input types. Our findings provide practical and actionable guidance. (1) LLMs as code generators achieve the strongest overall performance on graph data, with especially large gains on long-text or high-degree graphs where prompting quickly exceeds the token budget. (2) All interaction strategies remain effective on heterophilic graphs, challenging the assumption that LLM-based methods collapse under low homophily. (3) Code generation is able to flexibly adapt its reliance between structure, features, or labels to leverage the most informative input type. Together, these findings provide a comprehensive view of the strengths and limitations of current LLM-graph interaction modes and highlight key design principles for future approaches.

翻译：大语言模型（LLM）在欺诈检测和推荐系统等高影响力领域的文本丰富图机器学习任务（如节点分类）中正得到日益广泛的应用。然而，尽管相关研究兴趣激增，该领域对于LLM在与图数据交互时所展现的能力仍缺乏系统性的理解。在本研究中，我们围绕多个关键可变维度进行了大规模、受控的评估，以系统性地评估基于LLM的图推理方法在文本应用中的优势与局限。评估维度包括：LLM-图交互模式（对比提示、工具调用与代码生成）；数据集领域（涵盖引文网络、网页链接、电子商务及社交网络）；结构机制（对比同配性与异配性图）；特征特性（包含短文本与长文本节点属性）；以及模型配置（涵盖不同规模的LLM及其推理能力）。我们进一步通过系统性地截断特征、删除边和移除标签来分析模型对不同输入类型的依赖程度。我们的研究结果提供了具有实践指导意义的结论：（1）作为代码生成器的LLM在图数据上实现了最强的综合性能，尤其在长文本或高度数图数据上优势显著——此类场景中提示方法极易超出令牌预算；（2）所有交互策略在异配性图上均保持有效，这对“基于LLM的方法在低同配性条件下会失效”的假设提出了挑战；（3）代码生成能够灵活调整其对结构、特征或标签的依赖侧重，从而充分利用信息量最大的输入类型。这些发现共同构建了对当前LLM-图交互模式优势与局限的全面认知，并为未来方法的设计指明了关键原则。