As the field of Large Language Models (LLMs) evolves at an accelerated pace, the critical need to assess and monitor their performance emerges. We introduce a benchmarking framework focused on knowledge graph engineering (KGE) accompanied by three challenges addressing syntax and error correction, facts extraction and dataset generation. We show that while being a useful tool, LLMs are yet unfit to assist in knowledge graph generation with zero-shot prompting. Consequently, our LLM-KG-Bench framework provides automatic evaluation and storage of LLM responses as well as statistical data and visualization tools to support tracking of prompt engineering and model performance.
翻译:随着大语言模型(LLMs)领域加速演进,评估与监测其性能的关键需求日益凸显。我们提出一个专注于知识图谱工程(KGE)的基准测试框架,并配套三项挑战:语法与错误修正、事实抽取及数据集生成。研究表明,尽管大语言模型是实用工具,但目前尚不具备通过零样本提示辅助知识图谱生成的能力。为此,我们的LLM-KG-Bench框架提供了对大语言模型响应的自动评估与存储功能,以及统计分析数据和可视化工具,以支持提示工程与模型性能的追踪。