It is intractable to evaluate the performance of Grammatical Error Correction (GEC) systems since GEC is a highly subjective task. Designing an evaluation metric that is as objective as possible is crucial to the development of GEC task. Previous mainstream evaluation metrics, i.e., reference-based metrics, introduce bias into the multi-reference evaluation because they extract edits without considering the presence of multiple references. To overcome the problem, we propose Chunk-LEvel Multi-reference Evaluation (CLEME) designed to evaluate GEC systems in multi-reference settings. First, CLEME builds chunk sequences with consistent boundaries for the source, the hypothesis and all the references, thus eliminating the bias caused by inconsistent edit boundaries. Then, based on the discovery that there exist boundaries between different grammatical errors, we automatically determine the grammatical error boundaries and compute F$_{0.5}$ scores in a novel way. Our proposed CLEME approach consistently and substantially outperforms existing reference-based GEC metrics on multiple reference sets in both corpus-level and sentence-level settings. Extensive experiments and detailed analyses demonstrate the correctness of our discovery and the effectiveness of our designed evaluation metric.
翻译:语法纠错(GEC)系统的评估因任务高度主观性而具有挑战性,设计尽可能客观的评估指标对GEC任务的发展至关重要。以往的主流评估指标(即基于参考答案的指标)在多参考答案评估中引入偏差,其原因在于这些指标在提取编辑操作时未考虑多个参考答案的存在。为解决该问题,我们提出块级多参考答案评估方法(CLEME),专门用于在多参考答案场景下评估GEC系统。首先,CLEME为源句、待评估句子及所有参考答案构建边界一致的块序列,从而消除因编辑边界不一致导致的偏差。其次,基于不同语法错误之间存在天然边界的发现,我们自动确定语法错误边界并以新颖方式计算F$_{0.5}$分数。实验表明,在语料级和句子级评估场景中,所提CLEME方法在多个参考答案集上均显著且稳定优于现有基于参考答案的GEC指标。大量实验与详细分析验证了本文发现的正確性以及所设计评估指标的有效性。