超越已知事实：生成未见时序知识以解决大语言模型评估中的数据污染问题 (Beyond Known Facts: Generating Unseen Temporal Knowledge to Address Data Contamination in LLM Evaluation)

The automatic extraction of information is important for populating large web knowledge bases such as Wikidata. The temporal version of that task, temporal knowledge graph extraction (TKGE), involves extracting temporally grounded facts from text, represented as semantic quadruples (subject, relation, object, timestamp). Many recent systems take advantage of large language models (LLMs), which are becoming a new cornerstone of the web due to their performance on many tasks across the natural language processing (NLP) field. Despite the importance of TKGE, existing datasets for training and evaluation remain scarce, and contamination of evaluation data is an unaddressed issue, potentially inflating LLMs' perceived performance due to overlaps between training and evaluation sets. To mitigate these challenges, we propose a novel synthetic evaluation dataset constructed from predicted future, previously unseen temporal facts, thereby eliminating contamination and enabling robust and unbiased benchmarking. Our dataset creation involves a two-step approach: (1) Temporal Knowledge Graph Forecasting (TKGF) generates plausible future quadruples, which are subsequently filtered to adhere to the original knowledge base schema; (2) LLMs perform quadruple-to-text generation, creating semantically aligned textual descriptions. We benchmark Extract, Define and Canonicalize (EDC), a state-of-the-art LLM-based extraction framework, demonstrating that LLM performance decreases when evaluated on our dataset compared to a dataset of known facts. We publicly release our dataset consisting of 4.2K future quadruples and corresponding textual descriptions, along with the generation methodology, enabling continuous creation of unlimited future temporal datasets to serve as long-term, contamination-free benchmarks for TKGE.

翻译：自动信息抽取对于填充大型网络知识库（如Wikidata）至关重要。该任务的时序版本——时序知识图谱抽取（TKGE）——涉及从文本中抽取具有时间基础的事实，表示为语义四元组（主体，关系，客体，时间戳）。许多近期系统利用了大语言模型（LLMs），因其在自然语言处理（NLP）领域的众多任务上表现出色，正成为网络的新基石。尽管TKGE十分重要，但现有的训练与评估数据集仍然稀缺，且评估数据污染是一个尚未解决的问题，由于训练集与评估集之间的重叠，可能虚增LLMs的感知性能。为应对这些挑战，我们提出了一种新颖的合成评估数据集，该数据集基于预测的、先前未见的未来时序事实构建，从而消除了数据污染，实现了稳健且无偏的基准测试。我们的数据集创建采用两步法：（1）时序知识图谱预测（TKGF）生成合理的未来四元组，随后进行过滤以确保符合原始知识库模式；（2）LLMs执行四元组到文本的生成，创建语义对齐的文本描述。我们对最先进的基于LLM的抽取框架——抽取、定义与规范化（EDC）——进行了基准测试，结果表明，与已知事实数据集相比，在我们的数据集上评估时LLM性能有所下降。我们公开发布了包含4.2K个未来四元组及对应文本描述的数据集，以及生成方法，支持持续创建无限量的未来时序数据集，以作为TKGE长期、无污染的基准测试资源。