It is increasingly common to evaluate the same coreference resolution (CR) model on multiple datasets. Do these multi-dataset evaluations allow us to draw meaningful conclusions about model generalization? Or, do they rather reflect the idiosyncrasies of a particular experimental setup (e.g., the specific datasets used)? To study this, we view evaluation through the lens of measurement modeling, a framework commonly used in the social sciences for analyzing the validity of measurements. By taking this perspective, we show how multi-dataset evaluations risk conflating different factors concerning what, precisely, is being measured. This in turn makes it difficult to draw more generalizable conclusions from these evaluations. For instance, we show that across seven datasets, measurements intended to reflect CR model generalization are often correlated with differences in both how coreference is defined and how it is operationalized; this limits our ability to draw conclusions regarding the ability of CR models to generalize across any singular dimension. We believe the measurement modeling framework provides the needed vocabulary for discussing challenges surrounding what is actually being measured by CR evaluations.
翻译:当前,评估同一指代消解模型在多个数据集上的表现日益普遍。这种多数据集评估能否让我们对模型泛化能力得出有意义的结论?抑或它们仅仅反映了特定实验设置(例如所用数据集的特殊性)的偶然特征?为探究此问题,我们通过测量建模的视角审视评估过程——这是社会科学中常用于分析测量效度的框架。基于这一视角,我们揭示了多数据集评估如何可能混淆关于“究竟测量什么”的不同因素,进而导致难以从这些评估中得出更具普适性的结论。例如,我们通过对七个数据集的实验表明:旨在反映指代消解模型泛化能力的测量结果,常与指代关系的定义方式及其操作化实现的差异产生关联;这限制了我们针对指代消解模型在单一维度上泛化能力得出结论的可能性。我们相信,测量建模框架为深入探讨指代消解评估中“实际测量内容”所面临的挑战提供了必要的理论语汇。