The robustness to distribution changes ensures that NLP models can be successfully applied in the realistic world, especially for information extraction tasks. However, most prior evaluation benchmarks have been devoted to validating pairwise matching correctness, ignoring the crucial measurement of robustness. In this paper, we present the first benchmark that simulates the evaluation of open information extraction models in the real world, where the syntactic and expressive distributions under the same knowledge meaning may drift variously. We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique that consists of sentences with structured knowledge of the same meaning but with different syntactic and expressive forms. By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques. We perform experiments on typical models published in the last decade as well as a popular large language model, the results show that the existing successful models exhibit a frustrating degradation, with a maximum drop of 23.43 F1 score. Our resources and code will be publicly available.
翻译:对分布变化的鲁棒性能够确保NLP模型在现实世界中成功应用,尤其对于信息抽取任务而言至关重要。然而,大多数已有的评估基准专注于验证成对匹配的正确性,而忽略了鲁棒性这一关键度量。本文提出了首个模拟现实世界中开放信息抽取模型评估的基准,其中同一知识含义下的句法和表达分布可能发生多样化漂移。我们设计并标注了一个大规模测试平台,其中每个样本是一个知识不变簇,包含具有相同知识含义但不同句法和表达形式的句子。通过进一步完善鲁棒性度量,若模型在整体簇上的表现持续准确,则判定其为鲁棒模型。我们对过去十年发表的典型模型以及一个流行的大型语言模型进行了实验,结果表明现有成功模型出现了令人沮丧的性能退化,最大降幅达23.43 F1分数。我们的资源和代码将公开提供。