Metrics are the foundation for automatic evaluation in grammatical error correction (GEC), with their evaluation of the metrics (meta-evaluation) relying on their correlation with human judgments. However, conventional meta-evaluations in English GEC encounter several challenges including biases caused by inconsistencies in evaluation granularity, and an outdated setup using classical systems. These problems can lead to misinterpretation of metrics and potentially hinder the applicability of GEC techniques. To address these issues, this paper proposes SEEDA, a new dataset for GEC meta-evaluation. SEEDA consists of corrections with human ratings along two different granularities: edit-based and sentence-based, covering 12 state-of-the-art systems including large language models (LLMs), and two human corrections with different focuses. The results of improved correlations by aligning the granularity in the sentence-level meta-evaluation, suggest that edit-based metrics may have been underestimated in existing studies. Furthermore, correlations of most metrics decrease when changing from classical to neural systems, indicating that traditional metrics are relatively poor at evaluating fluently corrected sentences with many edits.
翻译:评估指标是语法错误纠正(GEC)自动评估的基础,而对指标的评估(元评估)则依赖于它们与人工判断的相关性。然而,英语GEC中的传统元评估面临多项挑战,包括因评估粒度不一致导致的偏差,以及使用经典系统的过时设置。这些问题可能导致对指标的错误解读,并可能阻碍GEC技术的实际应用。为解决这些问题,本文提出了SEEDA,一个用于GEC元评估的新数据集。SEEDA包含基于编辑和基于句子两种不同粒度的人工评分校正结果,覆盖了12个最先进的系统(包括大语言模型)以及两个不同侧重点的人工校正结果。通过在句子级元评估中对齐粒度,相关性的提升表明,基于编辑的指标在现有研究中可能被低估。此外,当从经典系统转向神经网络系统时,大多数指标的相关性下降,这表明传统指标在评估包含大量编辑的流畅校正句子时表现较差。