Event extraction (EE) is a crucial task aiming at extracting events from texts, which includes two subtasks: event detection (ED) and event argument extraction (EAE). In this paper, we check the reliability of EE evaluations and identify three major pitfalls: (1) The data preprocessing discrepancy makes the evaluation results on the same dataset not directly comparable, but the data preprocessing details are not widely noted and specified in papers. (2) The output space discrepancy of different model paradigms makes different-paradigm EE models lack grounds for comparison and also leads to unclear mapping issues between predictions and annotations. (3) The absence of pipeline evaluation of many EAE-only works makes them hard to be directly compared with EE works and may not well reflect the model performance in real-world pipeline scenarios. We demonstrate the significant influence of these pitfalls through comprehensive meta-analyses of recent papers and empirical experiments. To avoid these pitfalls, we suggest a series of remedies, including specifying data preprocessing, standardizing outputs, and providing pipeline evaluation results. To help implement these remedies, we develop a consistent evaluation framework OMNIEVENT, which can be obtained from https://github.com/THU-KEG/OmniEvent.
翻译:事件抽取(EE)是一项从文本中抽取事件的关键任务,包括两个子任务:事件检测(ED)和事件论元抽取(EAE)。本文检验了EE评估的可靠性,并识别出三大主要陷阱:(1)数据预处理差异使得同一数据集上的评估结果不可直接比较,但数据预处理细节在论文中未被广泛标注和说明;(2)不同模型范式的输出空间差异导致不同范式的EE模型缺乏比较基础,并引发预测与标注之间的映射关系不明确问题;(3)许多仅评估EAE的工作缺乏流水线评估,使其难以与EE工作直接比较,且可能无法真实反映模型在实际流水线场景中的性能。我们通过综合元分析和实证实验,展示了这些陷阱的显著影响。为避免这些陷阱,我们提出了一系列补救措施,包括明确数据预处理、标准化输出以及提供流水线评估结果。为实施这些措施,我们开发了统一评估框架OMNIEVENT,可从https://github.com/THU-KEG/OmniEvent获取。