The recent advances in large language models (LLM) and foundation models with emergent capabilities have been shown to improve the performance of many NLP tasks. LLMs and Knowledge Graphs (KG) can complement each other such that LLMs can be used for KG construction or completion while existing KGs can be used for different tasks such as making LLM outputs explainable or fact-checking in Neuro-Symbolic manner. In this paper, we present Text2KGBench, a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences. We provide two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences. We define seven evaluation metrics to measure fact extraction performance, ontology conformance, and hallucinations by LLMs. Furthermore, we provide results for two baseline models, Vicuna-13B and Alpaca-LoRA-13B using automatic prompt generation from test cases. The baseline results show that there is room for improvement using both Semantic Web and Natural Language Processing techniques.
翻译:近年来,具有涌现能力的大型语言模型(LLM)和基础模型在提升诸多自然语言处理任务性能方面取得了进展。LLM与知识图谱(KG)可相互补充:LLM可用于知识图谱的构建或补全,而现有知识图谱则可用于实现LLM输出的可解释性或神经符号化的事实核查等不同任务。本文提出了Text2KGBench基准,旨在评估语言模型在给定本体指导下从自然语言文本生成知识图谱的能力。给定输入本体和句子集合,任务要求从文本中提取事实,同时需严格遵循给定的本体(包含概念、关系、领域/范围约束)并忠实于输入句子。我们提供了两个数据集:(i)包含10个本体和13,474个句子的Wikidata-TekGen;(ii)包含19个本体和4,860个句子的DBpedia-WebNLG。我们定义了七项评估指标,用于衡量事实提取性能、本体符合度及大语言模型的幻觉程度。此外,我们采用测试用例自动提示生成方法,提供了两个基线模型Vicuna-13B和Alpaca-LoRA-13B的实验结果。基线结果表明,结合语义网与自然语言处理技术仍有改进空间。