Evaluating Language Models for Knowledge Base Completion

Structured knowledge bases (KBs) are a foundation of many intelligent applications, yet are notoriously incomplete. Language models (LMs) have recently been proposed for unsupervised knowledge base completion (KBC), yet, despite encouraging initial results, questions regarding their suitability remain open. Existing evaluations often fall short because they only evaluate on popular subjects, or sample already existing facts from KBs. In this work, we introduce a novel, more challenging benchmark dataset, and a methodology tailored for a realistic assessment of the KBC potential of LMs. For automated assessment, we curate a dataset called WD-KNOWN, which provides an unbiased random sample of Wikidata, containing over 3.9 million facts. In a second step, we perform a human evaluation on predictions that are not yet in the KB, as only this provides real insights into the added value over existing KBs. Our key finding is that biases in dataset conception of previous benchmarks lead to a systematic overestimate of LM performance for KBC. However, our results also reveal strong areas of LMs. We could, for example, perform a significant completion of Wikidata on the relations nativeLanguage, by a factor of ~21 (from 260k to 5.8M) at 82% precision, usedLanguage, by a factor of ~2.1 (from 2.1M to 6.6M) at 82% precision, and citizenOf by a factor of ~0.3 (from 4.2M to 5.3M) at 90% precision. Moreover, we find that LMs possess surprisingly strong generalization capabilities: even on relations where most facts were not directly observed in LM training, prediction quality can be high.

翻译：结构化知识库是众多智能应用的基础，但普遍存在不完整的问题。近年来，语言模型被提出用于无监督知识库补全，尽管初步结果令人鼓舞，但其适用性仍存在疑问。现有评估往往存在不足，原因在于仅针对流行实体进行评估，或从知识库中采样已有事实。本研究引入了一个更具挑战性的新型基准数据集，并设计了一套方法论，用于真实评估语言模型在知识库补全任务中的潜力。在自动化评估中，我们整理了一个名为WD-KNOWN的数据集，该数据集提供了维基数据的无偏随机样本，包含超过390万个事实。随后，我们对尚未被知识库收录的预测结果进行人工评估——唯有此类评估才能真正揭示语言模型相对于现有知识库的增值价值。我们的核心发现是：先前基准数据集的设计偏差导致对语言模型在知识库补全任务上的性能存在系统性高估。然而，研究结果同时也揭示了语言模型的优势领域。例如，在82%准确率下，我们能够对维基数据中nativeLanguage关系进行约21倍的显著补全（从26万增至580万），对usedLanguage关系进行约2.1倍的补全（从210万增至660万），以及在90%准确率下对citizenOf关系进行约0.3倍的补全（从420万增至530万）。此外，我们发现语言模型具备惊人的泛化能力：即使在大多数事实未直接出现在语言模型训练语料中的关系上，预测质量仍可达到较高水平。