Event extraction has gained extensive research attention due to its broad range of applications. However, the current mainstream evaluation method for event extraction relies on token-level exact match, which misjudges numerous semantic-level correct cases. This reliance leads to a significant discrepancy between the evaluated performance of models under exact match criteria and their real performance. To address this problem, we propose a reliable and semantic evaluation framework for event extraction, named RAEE, which accurately assesses extraction results at semantic-level instead of token-level. Specifically, RAEE leverages large language models (LLMs) as evaluation agents, incorporating an adaptive mechanism to achieve adaptive evaluations for precision and recall of triggers and arguments. Extensive experiments demonstrate that: (1) RAEE achieves a very strong correlation with human judgments; (2) after reassessing 14 models, including advanced LLMs, on 10 datasets, there is a significant performance gap between exact match and RAEE. The exact match evaluation significantly underestimates the performance of existing event extraction models, and in particular underestimates the capabilities of LLMs; (3) fine-grained analysis under RAEE evaluation reveals insightful phenomena worth further exploration. The evaluation toolkit of our proposed RAEE is publicly released.
翻译:事件抽取因其广泛的应用范围而受到广泛研究关注。然而,当前主流的事件抽取评估方法依赖于词元级别的精确匹配,这导致大量语义级别正确的案例被误判。这种依赖使得模型在精确匹配标准下的评估性能与其真实性能之间存在显著差异。为解决此问题,我们提出了一种可靠且基于语义的事件抽取评估框架,命名为RAEE,该框架在语义层面而非词元层面对抽取结果进行准确评估。具体而言,RAEE利用大型语言模型作为评估代理,并引入自适应机制以实现对触发词和论元的精确率与召回率的自适应评估。大量实验表明:(1)RAEE与人类判断具有极强的相关性;(2)在对10个数据集上的14个模型(包括先进的大型语言模型)进行重评估后,精确匹配与RAEE之间存在显著的性能差距。精确匹配评估显著低估了现有事件抽取模型的性能,特别是低估了大型语言模型的能力;(3)在RAEE评估下的细粒度分析揭示了值得进一步探索的深刻现象。我们提出的RAEE评估工具包已公开发布。