Piecing Together Cross-Document Coreference Resolution Datasets: Systematic Dataset Analysis and Unification

Research in CDCR remains fragmented due to heterogeneous dataset formats, varying annotation standards, and the predominance of the CDCR definition as the event coreference resolution (ECR). To address these challenges, we introduce uCDCR, a unified dataset that consolidates diverse publicly available English CDCR corpora across various domains into a consistent format, which we analyze with standardized metrics and evaluation protocols. uCDCR incorporates both entity and event coreference, corrects known inconsistencies, and enriches datasets with missing attributes to facilitate reproducible research. We establish a cohesive framework for fair, interpretable, and cross-dataset analysis in CDCR and compare the datasets on their lexical properties, e.g., lexical composition of the annotated mentions, lexical diversity and ambiguity metrics, discuss the annotation rules and principles that lead to high lexical diversity, and examine how these metrics influence performance on the same-head-lemma baseline. Our dataset analysis shows that ECB+, the state-of-the-art benchmark for CDCR, has one of the lowest lexical diversities, and its CDCR complexity, measured by the same-head-lemma baseline, lies in the middle among all uCDCR datasets. Moreover, comparing document and mention distributions between ECB+ and uCDCR shows that using all uCDCR datasets for model training and evaluation will improve the generalizability of CDCR models. Finally, the almost identical performance on the same-head-lemma baseline, separately applied to events and entities, shows that resolving both types is a complex task and should not be steered toward ECR alone. The uCDCR dataset is available at https://huggingface.co/datasets/AnZhu/uCDCR, and the code for parsing, analyzing, and scoring the dataset is available at https://github.com/anastasia-zhukova/uCDCR.

翻译：由于数据集格式异构、标注标准不一，以及跨文档指代消解（CDCR）研究主要被定义为事件指代消解（ECR），该领域的研究仍处于碎片化状态。为解决这些挑战，我们提出了uCDCR——一个统一的数据集，它将多个公开可用的英文CDCR语料库（涵盖不同领域）整合为一致的格式，并通过标准化指标和评估协议进行分析。uCDCR同时包含实体与事件指代消解，修正了已知的不一致之处，并为数据集补充了缺失的属性，以促进可复现的研究。我们为CDCR建立了一个公平、可解释且跨数据集分析的整体框架，比较了各数据集的词汇特性（例如标注指称项的词汇构成、词汇多样性及歧义性指标），讨论了导致高词汇多样性的标注规则与原则，并检验了这些指标如何影响相同词元基线的性能。我们的数据集分析表明，当前CDCR最先进的基准数据集ECB+具有最低的词汇多样性之一，而其通过相同词元基线测得的CDCR复杂度在所有uCDCR数据集中处于中等水平。此外，通过比较ECB+与uCDCR在文档和指称项分布上的差异，我们发现使用所有uCDCR数据集进行模型训练与评估将提升CDCR模型的泛化能力。最后，在分别应用于事件和实体的相同词元基线上表现几乎一致的结果表明，消解这两类指代是一项复杂任务，不应仅偏向于ECR。uCDCR数据集发布于https://huggingface.co/datasets/AnZhu/uCDCR，用于解析、分析和评估数据集的代码发布于https://github.com/anastasia-zhukova/uCDCR。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【AAAI2025】SAIL：面向样本的上下文学习用于文档信息提取

专知会员服务

21+阅读 · 2024年12月24日

【WWW2024】基于提示增强的联邦内容表征学习的跨域推荐

专知会员服务

19+阅读 · 2024年1月29日

【ETH博士论文】构建从端到端的层次文档解析和OCR系统，154页pdf

专知会员服务

24+阅读 · 2023年7月29日