Document-Level Event Argument Extraction (DocEAE) is an extremely difficult information extraction problem -- with significant limitations in low-resource cross-domain settings. To address this problem, we introduce Mad Lib Aug (MLA), a novel generative DocEAE data augmentation framework. Our approach leverages the intuition that Mad Libs, which are categorically masked documents used as a part of a popular game, can be generated and solved by LLMs to produce data for DocEAE. Using MLA, we achieve a 2.6-point average improvement in overall F1 score. Moreover, this approach achieves a 3.9 and 5.2 point average increase in zero and few-shot event roles compared to augmentation-free baselines across all experiments. To better facilitate analysis of cross-domain DocEAE, we additionally introduce a new metric, Role-Depth F1 (RDF1), which uses statistical depth to identify roles in the target domain which are semantic outliers with respect to roles observed in the source domain. Our experiments show that MLA augmentation can boost RDF1 performance by an average of 5.85 points compared to non-augmented datasets.
翻译:文档级事件论元抽取(DocEAE)是一个极具挑战性的信息抽取问题——在低资源跨领域场景中面临显著限制。为解决此问题,我们提出Mad Lib Aug(MLA),一种新型生成式DocEAE数据增强框架。我们的方法利用了一种直觉:作为一种流行游戏组成部分的、通过类别性遮蔽处理的文档(即Mad Libs),可由大语言模型生成并求解,从而生成DocEAE训练数据。采用MLA后,我们在整体F1分数上平均提升2.6个百分点。此外,与所有实验中未进行增强的基线相比,该方法在零样本和少样本事件角色上分别实现平均3.9和5.2个百分点的提升。为更好地促进跨领域DocEAE分析,我们额外引入一项新指标——角色深度F1(RDF1),该指标利用统计深度识别目标领域中相对于源领域观察到的语义异常角色。实验表明,与非增强数据集相比,MLA增强可平均提升RDF1性能5.85个百分点。