Event extraction (EE) is a crucial task aiming at extracting events from texts, which includes two subtasks: event detection (ED) and event argument extraction (EAE). In this paper, we check the reliability of EE evaluations and identify three major pitfalls: (1) The data preprocessing discrepancy makes the evaluation results on the same dataset not directly comparable, but the data preprocessing details are not widely noted and specified in papers. (2) The output space discrepancy of different model paradigms makes different-paradigm EE models lack grounds for comparison and also leads to unclear mapping issues between predictions and annotations. (3) The absence of pipeline evaluation of many EAE-only works makes them hard to be directly compared with EE works and may not well reflect the model performance in real-world pipeline scenarios. We demonstrate the significant influence of these pitfalls through comprehensive meta-analyses of recent papers and empirical experiments. To avoid these pitfalls, we suggest a series of remedies, including specifying data preprocessing, standardizing outputs, and providing pipeline evaluation results. To help implement these remedies, we develop a consistent evaluation framework OMNIEVENT, which can be obtained from https://github.com/THU-KEG/OmniEvent.
翻译:事件抽取(EE)是一项关键任务,旨在从文本中抽取事件,包括两个子任务:事件检测(ED)和事件论元抽取(EAE)。本文检验了事件抽取评估的可靠性,并识别出三大主要陷阱:(1)数据预处理差异导致同一数据集上的评估结果无法直接比较,但数据预处理细节在论文中未被广泛关注与明确说明;(2)不同模型范式的输出空间差异使得跨范式事件抽取模型缺乏比较基础,并导致预测与标注结果之间的映射关系不清晰;(3)许多仅聚焦事件论元抽取的研究缺乏流水线评估,使其难以与事件抽取工作直接比较,且可能无法真实反映模型在实际流水线场景中的性能。我们通过近期论文的元分析与实证实验,全面论证了这些陷阱的显著影响。为避免上述陷阱,我们提出一系列改进措施,包括规范数据预处理、标准化输出以及提供流水线评估结果。为辅助实施这些措施,我们开发了统一的评估框架OMNIEVENT,可从https://github.com/THU-KEG/OmniEvent获取。