The Devil is in the Details: On the Pitfalls of Event Extraction Evaluation

Event extraction (EE) is a crucial task aiming at extracting events from texts, which includes two subtasks: event detection (ED) and event argument extraction (EAE). In this paper, we check the reliability of EE evaluations and identify three major pitfalls: (1) The data preprocessing discrepancy makes the evaluation results on the same dataset not directly comparable, but the data preprocessing details are not widely noted and specified in papers. (2) The output space discrepancy of different model paradigms makes different-paradigm EE models lack grounds for comparison and also leads to unclear mapping issues between predictions and annotations. (3) The absence of pipeline evaluation of many EAE-only works makes them hard to be directly compared with EE works and may not well reflect the model performance in real-world pipeline scenarios. We demonstrate the significant influence of these pitfalls through comprehensive meta-analyses of recent papers and empirical experiments. To avoid these pitfalls, we suggest a series of remedies, including specifying data preprocessing, standardizing outputs, and providing pipeline evaluation results. To help implement these remedies, we develop a consistent evaluation framework OMNIEVENT, which can be obtained from https://github.com/THU-KEG/OmniEvent.

翻译：事件抽取（EE）是一项从文本中抽取事件的关键任务，包括两个子任务：事件检测（ED）和事件论元抽取（EAE）。本文检验了EE评估的可靠性，并识别出三大主要陷阱：（1）数据预处理差异使得同一数据集上的评估结果不可直接比较，但数据预处理细节在论文中未被广泛标注和说明；（2）不同模型范式的输出空间差异导致不同范式的EE模型缺乏比较基础，并引发预测与标注之间的映射关系不明确问题；（3）许多仅评估EAE的工作缺乏流水线评估，使其难以与EE工作直接比较，且可能无法真实反映模型在实际流水线场景中的性能。我们通过综合元分析和实证实验，展示了这些陷阱的显著影响。为避免这些陷阱，我们提出了一系列补救措施，包括明确数据预处理、标准化输出以及提供流水线评估结果。为实施这些措施，我们开发了统一评估框架OMNIEVENT，可从https://github.com/THU-KEG/OmniEvent获取。

相关内容

数据预处理

关注 1176

数据预处理（data preprocessing）是指在主要的处理以前对数据进行的一些处理。如对大部分地球物理面积性观测数据在进行转换或增强处理之前，首先将不规则分布的测网经过插值转换为规则网的处理，以利于计算机的运算。另外，对于一些剖面测量数据，如地震资料预处理有垂直叠加、重排、加道头、编辑、重新取样、多路编辑等。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日