Coreference resolution models are often evaluated on multiple datasets. Datasets vary, however, in how coreference is realized -- i.e., how the theoretical concept of coreference is operationalized in the dataset -- due to factors such as the choice of corpora and annotation guidelines. We investigate the extent to which errors of current coreference resolution models are associated with existing differences in operationalization across datasets (OntoNotes, PreCo, and Winogrande). Specifically, we distinguish between and break down model performance into categories corresponding to several types of coreference, including coreferring generic mentions, compound modifiers, and copula predicates, among others. This break down helps us investigate how state-of-the-art models might vary in their ability to generalize across different coreference types. In our experiments, for example, models trained on OntoNotes perform poorly on generic mentions and copula predicates in PreCo. Our findings help calibrate expectations of current coreference resolution models; and, future work can explicitly account for those types of coreference that are empirically associated with poor generalization when developing models.
翻译:指代消解模型通常在多个数据集上进行评估。然而,由于语料库选择和标注准则等因素,不同数据集在指代实现方式上存在差异——即指代的理论概念在数据集中是如何操作化的。我们探讨当前指代消解模型的错误与跨数据集(OntoNotes、PreCo和Winogrande)操作化差异之间的关联程度。具体而言,我们区分并将模型性能分解为对应多种指代类型的类别,包括指代类指提及、复合修饰语和系词谓词等。这种分解有助于研究最先进模型在不同指代类型上的泛化能力差异。例如,在我们的实验中,基于OntoNotes训练的模型在PreCo的类指提及和系词谓词上表现较差。我们的发现有助于校准对当前指代消解模型的预期;未来研究在开发模型时,可明确考虑那些经验上与泛化能力不足相关的指代类型。