With the recent spike in the number and availability of Large Language Models (LLMs), it has become increasingly important to provide large and realistic benchmarks for evaluating Knowledge Graph Question Answering (KGQA) systems. So far the majority of benchmarks rely on pattern-based SPARQL query generation approaches. The subsequent natural language (NL) question generation is conducted through crowdsourcing or other automated methods, such as rule-based paraphrasing or NL question templates. Although some of these datasets are of considerable size, their pitfall lies in their pattern-based generation approaches, which do not always generalize well to the vague and linguistically diverse questions asked by humans in real-world contexts. In this paper, we introduce Spider4SPARQL - a new SPARQL benchmark dataset featuring 9,693 previously existing manually generated NL questions and 4,721 unique, novel, and complex SPARQL queries of varying complexity. In addition to the NL/SPARQL pairs, we also provide their corresponding 166 knowledge graphs and ontologies, which cover 138 different domains. Our complex benchmark enables novel ways of evaluating the strengths and weaknesses of modern KGQA systems. We evaluate the system with state-of-the-art KGQA systems as well as LLMs, which achieve only up to 45\% execution accuracy, demonstrating that Spider4SPARQL is a challenging benchmark for future research.
翻译:随着大型语言模型(LLMs)数量和可用性的激增,为评估知识图谱问答(KGQA)系统提供大规模且逼真的基准测试变得日益重要。迄今为止,大多数基准测试依赖于基于模式的SPARQL查询生成方法。随后的自然语言(NL)问题生成通过众包或其他自动化方法(如基于规则的释义或NL问题模板)实现。尽管其中一些数据集规模可观,但其缺陷在于基于模式的生成方法往往无法很好地泛化到真实场景中人类提出的模糊且语言多样的问题。在本文中,我们引入了Spider4SPARQL——一个新的SPARQL基准数据集,包含9,693个先前存在的手动生成的自然语言问题,以及4,721个独特、新颖且复杂度各异的SPARQL查询。除了自然语言/SPARQL配对数据,我们还提供了对应的166个知识图谱和本体,覆盖138个不同领域。我们的复杂基准能够以新颖的方式评估现代KGQA系统的优势与不足。我们使用最先进的KGQA系统以及LLMs对该系统进行评估,其执行准确率最高仅达45%,这表明Spider4SPARQL是未来研究中具有挑战性的基准测试。