ODKE+：基于本体引导的大语言模型开放域知识抽取 (ODKE+: Ontology-Guided Open-Domain Knowledge Extraction with LLMs)

Samira Khorshidi,Azadeh Nikfarjam,Suprita Shankar,Yisi Sang,Yash Govind,Hyun Jang,Ali Kasgari,Alexis McClimans,Mohamed Soliman,Vishnu Konda,Ahmed Fakhry,Xiaoguang Qi

Knowledge graphs (KGs) are foundational to many AI applications, but maintaining their freshness and completeness remains costly. We present ODKE+, a production-grade system that automatically extracts and ingests millions of open-domain facts from web sources with high precision. ODKE+ combines modular components into a scalable pipeline: (1) the Extraction Initiator detects missing or stale facts, (2) the Evidence Retriever collects supporting documents, (3) hybrid Knowledge Extractors apply both pattern-based rules and ontology-guided prompting for large language models (LLMs), (4) a lightweight Grounder validates extracted facts using a second LLM, and (5) the Corroborator ranks and normalizes candidate facts for ingestion. ODKE+ dynamically generates ontology snippets tailored to each entity type to align extractions with schema constraints, enabling scalable, type-consistent fact extraction across 195 predicates. The system supports batch and streaming modes, processing over 9 million Wikipedia pages and ingesting 19 million high-confidence facts with 98.8% precision. ODKE+ significantly improves coverage over traditional methods, achieving up to 48% overlap with third-party KGs and reducing update lag by 50 days on average. Our deployment demonstrates that LLM-based extraction, grounded in ontological structure and verification workflows, can deliver trustworthiness, production-scale knowledge ingestion with broad real-world applicability. A recording of the system demonstration is included with the submission and is also available at https://youtu.be/UcnE3_GsTWs.

翻译：知识图谱（KG）是许多人工智能应用的基础，但保持其时效性和完整性仍然成本高昂。本文提出ODKE+，一个生产级系统，能够以高精度自动从网络源中抽取并吸收数百万条开放域事实。ODKE+将模块化组件组合成一个可扩展的流水线：（1）抽取启动器检测缺失或过时的事实；（2）证据检索器收集支持性文档；（3）混合知识抽取器同时应用基于模式的规则和基于本体引导的大语言模型（LLM）提示；（4）轻量级验证器使用第二个LLM验证抽取的事实；（5）确证器对候选事实进行排序和规范化以供吸收。ODKE+动态生成针对每种实体类型的本体片段，以使抽取结果符合模式约束，从而实现对195个谓词的可扩展、类型一致的事实抽取。该系统支持批处理和流处理模式，处理了超过900万个维基百科页面，吸收了1900万条高置信度事实，精确度达到98.8%。与传统方法相比，ODKE+显著提升了覆盖率，与第三方知识图谱的重叠率最高达到48%，并将平均更新延迟减少了50天。我们的部署实践表明，基于本体结构和验证流程的大语言模型抽取方法，能够实现可信的、生产规模的知识吸收，并具有广泛的现实世界适用性。系统演示录像已随提交材料提供，也可通过 https://youtu.be/UcnE3_GsTWs 访问。