Graph-structured data underpins applications from citation analysis and social-network modeling to molecular design and knowledge-graph construction, and Large Language Models (LLMs) are increasingly used as prompt-driven graph synthesizers. Classical graph-generation reviews catalog deep generative models and their evaluation primitives, but predate the LLM era and provide no foundation for evaluating instruction-following graph synthesis. Recent LLM-era benchmarks evaluate models along graph-type or task-domain axes; such organizations, however, average over structural complexity and cannot localize where in the complexity spectrum an LLM breaks down. To close this diagnostic gap, we introduce GraphInstruct, a progressive-complexity benchmark that stratifies LLM graph generation into six complexity levels and five evaluation dimensions, paired with 800 hand-authored instructions, 1,582 algorithmically synthesized reference solutions, and a 12-LLM capability evaluation across 45 (model, strategy) configurations. We find that discriminative power peaks at multi-constraint composition rather than reasoning depth, that no single prompting strategy dominates across levels or model families, and that domain-semantic constraints remain iteration-invariant under all tested methods -- pointing to retrieval rather than additional compute as the next research frontier. Atop the benchmark, a verification-guided iterative framework with constraint-aware adaptive prompting consistently surpasses the prompt-engineering ceiling on tested target models, demonstrating that the benchmark's fine-grained signals drive method development. Data, code, and reproducibility artifacts are released alongside the paper at https://github.com/AI4DataSynth/GraphInstruct_formal
翻译:图结构数据支撑着从引文分析、社交网络建模到分子设计与知识图谱构建等应用,而大型语言模型(LLMs)正日益成为基于提示驱动的图合成器。经典的图生成综述系统梳理了深度生成模型及其评估基础要素,但均早于LLM时代,未能为评估遵循指令的图合成提供基础。近期LLM时代的基准沿图类型或任务领域维度评估模型;然而,此类组织方式会平均化结构复杂性,无法定位LLM在复杂性谱系中的失效点。为填补这一诊断空白,我们提出GraphInstruct——一个渐进复杂性基准,将LLM图生成划分为六个复杂性等级和五个评估维度,配套800条人工编写的指令、1582个算法合成的参考解,以及覆盖45种(模型、策略)配置的12个LLM能力评估。研究发现:判别力峰值出现在多约束组合而非推理深度处;单一提示策略无法跨等级或模型家族占据主导地位;领域语义约束在所有测试方法下均具迭代不变性——这指向检索而非额外计算作为下一研究前沿。在该基准基础上,结合约束感知自适应提示的验证引导迭代框架持续超越目标模型的提示工程上限,证明了基准的细粒度信号能驱动方法发展。数据、代码及复现资源随论文发布于https://github.com/AI4DataSynth/GraphInstruct_formal