The robustness to distribution changes ensures that NLP models can be successfully applied in the realistic world, especially for information extraction tasks. However, most prior evaluation benchmarks have been devoted to validating pairwise matching correctness, ignoring the crucial measurement of robustness. In this paper, we present the first benchmark that simulates the evaluation of open information extraction models in the real world, where the syntactic and expressive distributions under the same knowledge meaning may drift variously. We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique that consists of sentences with structured knowledge of the same meaning but with different syntactic and expressive forms. By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques. We perform experiments on typical models published in the last decade as well as a popular large language model, the results show that the existing successful models exhibit a frustrating degradation, with a maximum drop of 23.43 F1 score. Our resources and code are available at https://github.com/qijimrc/ROBUST.
翻译:对分布变化的鲁棒性确保了自然语言处理模型能够在现实世界中成功应用,尤其是在信息抽取任务中。然而,以往的大多数评估基准都致力于验证成对匹配的正确性,而忽略了关键的鲁棒性度量。本文提出了首个模拟现实世界中开放信息抽取模型评估的基准,其中相同知识含义下的句法和表达分布可能发生多种漂移。我们设计并标注了一个大规模测试平台,其中每个示例都是一个知识不变性团簇,包含具有相同结构化知识含义但句法和表达形式不同的句子。通过进一步细化鲁棒性度量指标,若模型在整体团簇上的性能保持持续准确,则判定该模型具有鲁棒性。我们对过去十年发表的典型模型以及一个流行的大语言模型进行了实验,结果表明现有成功模型出现了令人沮丧的性能退化,F1值最大下降23.43。我们的资源和代码可在https://github.com/qijimrc/ROBUST获取。