Evaluating the graph comprehension and reasoning abilities of Large Language Models (LLMs) is challenging and often incomplete. Existing benchmarks focus primarily on pure graph understanding, lacking a comprehensive evaluation across all graph types and detailed capability definitions. This paper presents GraCoRe, a benchmark for systematically assessing LLMs' graph comprehension and reasoning. GraCoRe uses a three-tier hierarchical taxonomy to categorize and test models on pure graph and heterogeneous graphs, subdividing capabilities into 10 distinct areas tested through 19 tasks. Our benchmark includes 11 datasets with 5,140 graphs of varying complexity. We evaluated three closed-source and seven open-source LLMs, conducting thorough analyses from both ability and task perspectives. Key findings reveal that semantic enrichment enhances reasoning performance, node ordering impacts task success, and the ability to process longer texts does not necessarily improve graph comprehension or reasoning. GraCoRe is open-sourced at https://github.com/ZIKEYUAN/GraCoRe
翻译:评估大语言模型(LLMs)的图理解与推理能力具有挑战性且往往不够全面。现有基准主要关注纯图理解,缺乏对所有图类型的全面评估及细粒度能力定义。本文提出GraCoRe,这是一个用于系统评估LLMs图理解与推理能力的基准。GraCoRe采用三层分级分类法,在纯图和异质图范畴内对模型进行分类测试,将能力细分为10个不同领域,并通过19项任务进行检验。我们的基准包含11个数据集,涵盖5,140个不同复杂度的图。我们评估了三个闭源和七个开源LLM,从能力与任务双视角进行了深入分析。关键发现表明:语义增强能提升推理性能,节点排序影响任务成功率,而处理更长文本的能力并不必然改善图理解或推理表现。GraCoRe已在 https://github.com/ZIKEYUAN/GraCoRe 开源。