GENEVA: Benchmarking Generalizability for Event Argument Extraction with Hundreds of Event Types and Argument Roles

Recent works in Event Argument Extraction (EAE) have focused on improving model generalizability to cater to new events and domains. However, standard benchmarking datasets like ACE and ERE cover less than 40 event types and 25 entity-centric argument roles. Limited diversity and coverage hinder these datasets from adequately evaluating the generalizability of EAE models. In this paper, we first contribute by creating a large and diverse EAE ontology. This ontology is created by transforming FrameNet, a comprehensive semantic role labeling (SRL) dataset for EAE, by exploiting the similarity between these two tasks. Then, exhaustive human expert annotations are collected to build the ontology, concluding with 115 events and 220 argument roles, with a significant portion of roles not being entities. We utilize this ontology to further introduce GENEVA, a diverse generalizability benchmarking dataset comprising four test suites, aimed at evaluating models' ability to handle limited data and unseen event type generalization. We benchmark six EAE models from various families. The results show that owing to non-entity argument roles, even the best-performing model can only achieve 39% F1 score, indicating how GENEVA provides new challenges for generalization in EAE. Overall, our large and diverse EAE ontology can aid in creating more comprehensive future resources, while GENEVA is a challenging benchmarking dataset encouraging further research for improving generalizability in EAE. The code and data can be found at https://github.com/PlusLabNLP/GENEVA.

翻译：近期事件论元抽取（EAE）研究聚焦于提升模型通用性以适应新事件和领域。然而，ACE和ERE等标准基准数据集仅覆盖不到40种事件类型和25种实体中心论元角色。有限的多样性和覆盖范围阻碍了这些数据集对EAE模型通用性的充分评估。本文首先通过构建大规模且多样化的EAE本体做出贡献。该本体利用框架语义标注（SRL）数据集FrameNet与EAE任务之间的相似性，将其转换后创建而成。随后，我们收集了详尽的人类专家标注构建该本体，最终包含115种事件和220种论元角色，其中大部分角色并非实体。基于此本体，我们进一步引入GENEVA——一个包含四个测试套件的多样化通用性基准数据集，旨在评估模型处理有限数据和未见事件类型泛化的能力。我们对来自不同家族的六种EAE模型进行了基准测试。结果表明，由于非实体论元角色的存在，即使表现最佳的模型也只能达到39%的F1分数，这揭示了GENEVA如何为EAE的泛化能力带来新挑战。总体而言，我们的大规模多样化EAE本体有助于创建更全面的未来资源，而GENEVA作为一个具有挑战性的基准数据集，将推动进一步提升EAE通用性的研究。代码与数据可通过https://github.com/PlusLabNLP/GENEVA获取。