EVENT5Ws: A Large Dataset for Open-Domain Event Extraction from Documents

Event extraction identifies the central aspects of events from text. It supports event understanding and analysis, which is crucial for tasks such as informed decision-making in emergencies. Therefore, it is necessary to develop automated event extraction approaches. However, existing datasets for algorithm development have limitations, including limited coverage of event types in closed-domain settings and a lack of large, manually verified dataset in open-domain settings. To address these limitations, we create EVENT5Ws , a large, manually annotated, and statistically verified open-domain event extraction dataset. We design a systematic annotation pipeline to create the dataset and provide empirical insights into annotation complexity. Using EVENT5Ws, we evaluate state-of-the-art pre-trained large language models and establish a benchmark for future research. We further show that models trained on EVENT5Ws generalize effectively to datasets from different geographical contexts, which demonstrates its potential for developing generalizable algorithms. Finally, we summarize the lessons learned during the dataset development and provide recommendations to support future large-scale dataset development.

翻译：事件抽取旨在从文本中识别事件的核心要素，为事件理解与分析提供支持，这对应急响应中的知情决策等任务至关重要。因此，开发自动化事件抽取方法具有必要性。然而，现有算法开发数据集存在局限性：封闭域场景中事件类型覆盖有限，且开放域场景中缺乏大规模人工验证数据集。为克服上述局限，我们构建了EVENT5Ws——一个经人工标注与统计验证的大规模开放域事件抽取数据集。我们设计了系统化的标注流程以创建该数据集，并提供了关于标注复杂性的实证洞见。利用EVENT5Ws，我们评估了当前最优的预训练大语言模型，并为未来研究建立了基准。进一步研究表明，在EVENT5Ws上训练的模型能够有效泛化至不同地理背景的数据集，这彰显了其开发可泛化算法的潜力。最后，我们总结了数据集开发过程中的经验教训，并为未来大规模数据集开发提供了建议。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

面向研究问题的深度学习事件抽取综述

专知会员服务

26+阅读 · 2024年12月9日

CMNEE：基于开源中国军事新闻的大规模文档级事件抽取数据集

专知会员服务

48+阅读 · 2024年6月2日

「深度学习事件抽取」最新2022研究综述

专知会员服务

72+阅读 · 2022年6月2日

埃默里大学最新「大数据时代事件预测」综述论文，37页pdf

专知会员服务

29+阅读 · 2021年7月20日