The Devil is in the Details: On the Pitfalls of Event Extraction Evaluation

Event extraction (EE) is a crucial task aiming at extracting events from texts, which includes two subtasks: event detection (ED) and event argument extraction (EAE). In this paper, we check the reliability of EE evaluations and identify three major pitfalls: (1) The data preprocessing discrepancy makes the evaluation results on the same dataset not directly comparable, but the data preprocessing details are not widely noted and specified in papers. (2) The output space discrepancy of different model paradigms makes different-paradigm EE models lack grounds for comparison and also leads to unclear mapping issues between predictions and annotations. (3) The absence of pipeline evaluation of many EAE-only works makes them hard to be directly compared with EE works and may not well reflect the model performance in real-world pipeline scenarios. We demonstrate the significant influence of these pitfalls through comprehensive meta-analyses of recent papers and empirical experiments. To avoid these pitfalls, we suggest a series of remedies, including specifying data preprocessing, standardizing outputs, and providing pipeline evaluation results. To help implement these remedies, we develop a consistent evaluation framework OMNIEVENT, which can be obtained from https://github.com/THU-KEG/OmniEvent.

翻译：事件抽取（EE）是一项关键任务，旨在从文本中抽取事件，包括两个子任务：事件检测（ED）和事件论元抽取（EAE）。本文检验了事件抽取评估的可靠性，并识别出三大主要陷阱：（1）数据预处理差异导致同一数据集上的评估结果无法直接比较，但数据预处理细节在论文中未被广泛关注与明确说明；（2）不同模型范式的输出空间差异使得跨范式事件抽取模型缺乏比较基础，并导致预测与标注结果之间的映射关系不清晰；（3）许多仅聚焦事件论元抽取的研究缺乏流水线评估，使其难以与事件抽取工作直接比较，且可能无法真实反映模型在实际流水线场景中的性能。我们通过近期论文的元分析与实证实验，全面论证了这些陷阱的显著影响。为避免上述陷阱，我们提出一系列改进措施，包括规范数据预处理、标准化输出以及提供流水线评估结果。为辅助实施这些措施，我们开发了统一的评估框架OMNIEVENT，可从https://github.com/THU-KEG/OmniEvent获取。

相关内容

数据预处理

关注 1176

数据预处理（data preprocessing）是指在主要的处理以前对数据进行的一些处理。如对大部分地球物理面积性观测数据在进行转换或增强处理之前，首先将不规则分布的测网经过插值转换为规则网的处理，以利于计算机的运算。另外，对于一些剖面测量数据，如地震资料预处理有垂直叠加、重排、加道头、编辑、重新取样、多路编辑等。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日