All state-of-the-art coreference resolution (CR) models involve finetuning a pretrained language model. Whether the superior performance of one CR model over another is due to the choice of language model or other factors, such as the task-specific architecture, is difficult or impossible to determine due to lack of a standardized experimental setup. To resolve this ambiguity, we systematically evaluate five CR models and control for certain design decisions including the pretrained language model used by each. When controlling for language model size, encoder-based CR models outperform more recent decoder-based models in terms of both accuracy and inference speed. Surprisingly, among encoder-based CR models, more recent models are not always more accurate, and the oldest CR model that we test generalizes the best to out-of-domain textual genres. We conclude that controlling for the choice of language model reduces most, but not all, of the increase in F1 score reported in the past five years.
翻译:所有最先进的共指消解模型均涉及对预训练语言模型进行微调。某一共指消解模型性能优于另一模型的原因,究竟源于语言模型的选择,还是任务特定架构等其他因素,由于缺乏标准化实验设置,往往难以甚至无法确定。为厘清这一歧义,我们对五种共指消解模型进行系统评估,并对其关键设计决策(如各模型使用的预训练语言模型)加以控制。在控制语言模型规模时,基于编码器的共指消解模型在准确率和推理速度上均优于较新的基于解码器的模型。令人意外的是,在基于编码器的共指消解模型中,较新模型并不总是更准确,而我们测试的最早的共指消解模型在域外语料体裁上泛化能力最佳。我们得出结论:控制语言模型的选择能解释过去五年报告的大部分F1分数提升,但并非全部。