Large language models (LLMs) demonstrate great potential for problems with implicit graphical structures, while recent works seek to enhance the graph reasoning capabilities of LLMs through specialized instruction tuning. The resulting 'graph LLMs' are evaluated with in-distribution settings only, thus it remains underexplored whether LLMs are learning generalizable graph reasoning skills or merely memorizing patterns in the synthetic training data. To this end, we propose the NLGift benchmark, an evaluation suite of LLM graph reasoning generalization: whether LLMs could go beyond semantic, numeric, structural, reasoning patterns in the synthetic training data and improve utility on real-world graph-based tasks. Extensive experiments with two LLMs across four graph reasoning tasks demonstrate that while generalization on simple patterns (semantic, numeric) is somewhat satisfactory, LLMs struggle to generalize across reasoning and real-world patterns, casting doubt on the benefit of synthetic graph tuning for real-world tasks with underlying network structures. We explore three strategies to improve LLM graph reasoning generalization, and we find that while post-training alignment is most promising for real-world tasks, empowering LLM graph reasoning to go beyond pattern memorization remains an open research question.
翻译:大型语言模型(LLMs)在处理具有隐式图结构的问题时展现出巨大潜力,而近期研究试图通过专门的指令微调来增强LLMs的图推理能力。现有评估仅针对分布内场景,因此LLMs究竟是在学习可泛化的图推理能力,还是仅仅记忆合成训练数据中的模式,这一问题尚未得到充分探索。为此,我们提出了NLGift基准测试——一个评估LLM图推理泛化能力的测试集:即LLMs能否超越合成训练数据中的语义、数值、结构及推理模式,并提升在现实世界图任务中的实用性。通过对两种LLMs在四项图推理任务上的大量实验表明,虽然模型在简单模式(语义、数值)上的泛化能力尚可,但在推理模式与现实世界模式上的泛化表现欠佳,这引发了对基于合成图数据微调能否真正有益于具有底层网络结构的现实世界任务的质疑。我们探索了三种提升LLM图推理泛化能力的策略,发现尽管后训练对齐对现实任务最具潜力,但如何使LLM的图推理能力突破模式记忆的局限,仍是一个有待解决的开放研究问题。