In this work, we are interested in automated methods for knowledge graph creation (KGC) from input text. Progress on large language models (LLMs) has prompted a series of recent works applying them to KGC, e.g., via zero/few-shot prompting. Despite successes on small domain-specific datasets, these models face difficulties scaling up to text common in many real-world applications. A principal issue is that in prior methods, the KG schema has to be included in the LLM prompt to generate valid triplets; larger and more complex schema easily exceed the LLMs' context window length. To address this problem, we propose a three-phase framework named Extract-Define-Canonicalize (EDC): open information extraction followed by schema definition and post-hoc canonicalization. EDC is flexible in that it can be applied to settings where a pre-defined target schema is available and when it is not; in the latter case, it constructs a schema automatically and applies self-canonicalization. To further improve performance, we introduce a trained component that retrieves schema elements relevant to the input text; this improves the LLMs' extraction performance in a retrieval-augmented generation-like manner. We demonstrate on three KGC benchmarks that EDC is able to extract high-quality triplets without any parameter tuning and with significantly larger schemas compared to prior works.
翻译:本研究旨在探索从输入文本中自动构建知识图谱(KGC)的方法。大语言模型(LLM)的发展推动了近期一系列将其应用于KGC的研究,例如通过零样本/少样本提示。尽管在特定领域的小型数据集上取得了成功,但此类模型在扩展到许多实际应用中常见的文本时仍面临困难。主要问题在于,先前方法需将知识图谱模式纳入LLM提示才能生成有效三元组,而更庞大复杂的模式容易超出LLM的上下文窗口长度。为解决该问题,我们提出名为“提取-定义-规范化”(EDC)的三阶段框架:首先进行开放信息提取,随后进行模式定义与事后规范化。EDC具有灵活性,既适用于预定义目标模式存在的情况,也适用于不存在预定义模式的情况——后者可自动构建模式并应用自我规范化。为进一步提升性能,我们引入一个训练组件,用于检索与输入文本相关的模式元素,通过类似检索增强生成的方法改进LLM的提取效果。我们在三个KGC基准测试上证明,EDC无需任何参数调优即可提取高质量三元组,且相较于先前方法能处理规模显著更大的模式。