GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation

Graph-structured data underpins applications from citation analysis and social-network modeling to molecular design and knowledge-graph construction, and Large Language Models (LLMs) are increasingly used as prompt-driven graph synthesizers. Classical graph-generation reviews catalog deep generative models and their evaluation primitives, but predate the LLM era and provide no foundation for evaluating instruction-following graph synthesis. Recent LLM-era benchmarks evaluate models along graph-type or task-domain axes; such organizations, however, average over structural complexity and cannot localize where in the complexity spectrum an LLM breaks down. To close this diagnostic gap, we introduce GraphInstruct, a progressive-complexity benchmark that stratifies LLM graph generation into six complexity levels and five evaluation dimensions, paired with 800 hand-authored instructions, 1,582 algorithmically synthesized reference solutions, and a 12-LLM capability evaluation across 45 (model, strategy) configurations. We find that discriminative power peaks at multi-constraint composition rather than reasoning depth, that no single prompting strategy dominates across levels or model families, and that domain-semantic constraints remain iteration-invariant under all tested methods -- pointing to retrieval rather than additional compute as the next research frontier. Atop the benchmark, a verification-guided iterative framework with constraint-aware adaptive prompting consistently surpasses the prompt-engineering ceiling on tested target models, demonstrating that the benchmark's fine-grained signals drive method development. Data, code, and reproducibility artifacts are released alongside the paper at https://github.com/AI4DataSynth/GraphInstruct_formal

翻译：图结构数据支撑着从引文分析、社交网络建模到分子设计与知识图谱构建等应用，而大型语言模型（LLMs）正日益成为基于提示驱动的图合成器。经典的图生成综述系统梳理了深度生成模型及其评估基础要素，但均早于LLM时代，未能为评估遵循指令的图合成提供基础。近期LLM时代的基准沿图类型或任务领域维度评估模型；然而，此类组织方式会平均化结构复杂性，无法定位LLM在复杂性谱系中的失效点。为填补这一诊断空白，我们提出GraphInstruct——一个渐进复杂性基准，将LLM图生成划分为六个复杂性等级和五个评估维度，配套800条人工编写的指令、1582个算法合成的参考解，以及覆盖45种（模型、策略）配置的12个LLM能力评估。研究发现：判别力峰值出现在多约束组合而非推理深度处；单一提示策略无法跨等级或模型家族占据主导地位；领域语义约束在所有测试方法下均具迭代不变性——这指向检索而非额外计算作为下一研究前沿。在该基准基础上，结合约束感知自适应提示的验证引导迭代框架持续超越目标模型的提示工程上限，证明了基准的细粒度信号能驱动方法发展。数据、代码及复现资源随论文发布于https://github.com/AI4DataSynth/GraphInstruct_formal

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

大型语言模型遇上文本属性图：一种融合框架与应用的综述

专知会员服务

10+阅读 · 2025年10月27日

大型语言模型（LLM）赋能的知识图谱构建：综述

专知会员服务

56+阅读 · 2025年10月24日

PlanGenLLMs：大型语言模型规划能力的最新综述

专知会员服务

34+阅读 · 2025年5月18日

定制化大型语言模型的图检索增强生成综述

专知会员服务

39+阅读 · 2025年1月28日