GraphInfer-Bench: Benchmarking LLM's Inference Capability on Graphs

Graph analysis underlies many applications whose answers cannot be looked up in a single record or retrieved along a path: laundering rings, drug repurposing, user preference, and scientific theme are all inferred from a node together with its neighbourhood. We introduce GraphInfer-Bench, a benchmark for whether LLMs can perform this graph inference: producing an open-ended answer that no single node supports and no path retrieves. Existing graph-QA protocols cannot test this capability: algorithm simulation, node classification, single-node description, KG-QA, and GraphRAG all admit answers retrievable from one node or along a path. GraphInfer-Bench defines five tasks along Description (what a region is) and Comparison (how regions differ), each constructed so the ground truth lives in no single node. The release contains 42,000 samples across six real-world graphs, produced automatically and screened by a four-layer quality-control protocol. We evaluate four method families against the same tasks: graph-token alignment models, zero-shot frontier closed-source LLMs, Graph2Text supervised fine-tuning, and plain GNNs as a structural reference. No method family closes the gap. Graph-token alignment partially handles description tasks (relational, theme) but collapses on comparison tasks. Frontier LLMs lead on outlier detection and community partition among LLM-based methods but lag on masked-node prediction. Graph2Text SFT is the strongest LLM-based method on the description side yet falls behind frontier LLMs on comparison. Across every task, plain GNNs match or beat the strongest LLM-based row, with the largest margin on community detection. GraphInfer-Bench surfaces graph inference as an open capability gap rather than a property of any one architecture.

翻译：[翻译后的中文摘要] 图分析构成了众多应用的基础，这些应用的答案无法从单个记录中查找，也无法沿路径检索：洗钱团伙、药物重定位、用户偏好以及科学主题均需从节点及其邻域中推断得出。我们提出GraphInfer-Bench基准，旨在评估大语言模型（LLM）是否能执行此类图推理：生成无单一节点支撑、无路径可检索的开放式答案。现有图问答协议无法测试该能力：算法模拟、节点分类、单节点描述、知识图谱问答及图增强检索生成（GraphRAG）等方法均允许从单一节点或沿路径检索答案。GraphInfer-Bench沿"描述"（区域特征）与"比较"（区域差异）两条主线定义五项任务，每项任务均确保真实答案不存于任何单一节点中。该基准数据集包含42,000个样本，覆盖六个真实世界图结构，通过自动生成与四层质量控制协议筛选而成。我们评估了四类方法体系：图-令牌对齐模型、零样本前沿闭源大语言模型、图到文本有监督微调（Graph2Text SFT），以及以普通图神经网络（GNN）作为结构基线。所有方法均未弥合性能差距。图-令牌对齐方法可部分处理描述类任务（如关系识别、主题归纳），但在比较类任务中表现崩溃。前沿大语言模型在LLM类方法中的异常检测与社区划分任务上领先，但在掩码节点预测中表现落后。Graph2Text SFT在描述任务中是表现最强的LLM方法，但在比较任务上劣于前沿大语言模型。在所有任务中，普通GNN均达到或超越最强LLM方法的表现，其中在社区检测任务上的优势最为显著。GraphInfer-Bench揭示了图推理是一个开放性的能力鸿沟，而非任何单一架构所固有的特性。