The emergence of large language models (LLMs) has revolutionized the way we interact with graphs, leading to a new paradigm called GraphLLM. Despite the rapid development of GraphLLM methods in recent years, the progress and understanding of this field remain unclear due to the lack of a benchmark with consistent experimental protocols. To bridge this gap, we introduce GLBench, the first comprehensive benchmark for evaluating GraphLLM methods in both supervised and zero-shot scenarios. GLBench provides a fair and thorough evaluation of different categories of GraphLLM methods, along with traditional baselines such as graph neural networks. Through extensive experiments on a collection of real-world datasets with consistent data processing and splitting strategies, we have uncovered several key findings. Firstly, GraphLLM methods outperform traditional baselines in supervised settings, with LLM-as-enhancers showing the most robust performance. However, using LLMs as predictors is less effective and often leads to uncontrollable output issues. We also notice that no clear scaling laws exist for current GraphLLM methods. In addition, both structures and semantics are crucial for effective zero-shot transfer, and our proposed simple baseline can even outperform several models tailored for zero-shot scenarios. The data and code of the benchmark can be found at https://github.com/NineAbyss/GLBench.
翻译:大语言模型(LLMs)的出现彻底改变了我们处理图数据的方式,催生了名为GraphLLM的新范式。尽管近年来GraphLLM方法快速发展,但由于缺乏具有统一实验协议的基准测试,该领域的进展与认知仍不清晰。为填补这一空白,我们提出了GLBench——首个在监督学习与零样本场景下评估GraphLLM方法的综合性基准。GLBench通过统一的数据处理与划分策略,在真实数据集集合上进行了大量实验,对不同类别的GraphLLM方法及图神经网络等传统基线模型进行了公平而全面的评估。实验揭示了若干关键发现:首先,在监督学习场景中,GraphLLM方法优于传统基线,其中以LLM作为增强器的方法表现出最稳健的性能;然而,直接将LLM作为预测器的效果欠佳,且常导致输出不可控问题。我们还发现当前GraphLLM方法尚未呈现明显的缩放定律。此外,结构与语义信息对有效的零样本迁移均至关重要,而我们提出的简单基线模型甚至能超越若干专为零样本场景设计的模型。基准测试的数据与代码详见https://github.com/NineAbyss/GLBench。