Evaluating the performance of Grammatical Error Correction (GEC) systems is a challenging task due to its subjectivity. Designing an evaluation metric that is as objective as possible is crucial to the development of GEC task. However, mainstream evaluation metrics, i.e., reference-based metrics, introduce bias into the multi-reference evaluation by extracting edits without considering the presence of multiple references. To overcome this issue, we propose Chunk-LEvel Multi-reference Evaluation (CLEME), designed to evaluate GEC systems in the multi-reference evaluation setting. CLEME builds chunk sequences with consistent boundaries for the source, the hypothesis and references, thus eliminating the bias caused by inconsistent edit boundaries. Furthermore, we observe the consistent boundary could also act as the boundary of grammatical errors, based on which the F$_{0.5}$ score is then computed following the correction independence assumption. We conduct experiments on six English reference sets based on the CoNLL-2014 shared task. Extensive experiments and detailed analyses demonstrate the correctness of our discovery and the effectiveness of CLEME. Further analysis reveals that CLEME is robust to evaluate GEC systems across reference sets with varying numbers of references and annotation style.
翻译:语法纠错(GEC)系统的性能评估因其主观性而具有挑战性。设计尽可能客观的评估指标对GEC任务的发展至关重要。然而,主流评估指标(即基于参考的指标)在提取编辑操作时未考虑多参考的存在,导致多参考评估中出现偏置。为了解决这一问题,我们提出基于块级的多参考评估方法(CLEME),用于评估多参考设置下的GEC系统。CLEME通过为源句、候选句和参考句构建边界一致的块序列,消除了由不一致编辑边界引起的偏置。此外,我们观察到一致边界还可作为语法错误的边界,在此基础上依据修正独立性假设计算F$_{0.5}$值。我们基于CoNLL-2014共享任务的六个英文参考集进行实验。大量实验和详细分析验证了我们的发现正确性及CLEME的有效性。进一步分析表明,CLEME在不同参考数量与标注风格的参考集中均能稳健评估GEC系统。