Automatic evaluation in grammatical error correction (GEC) is crucial for selecting the best-performing systems. Currently, reference-based metrics are a popular choice, which basically measure the similarity between hypothesis and reference sentences. However, similarity measures based on embeddings, such as BERTScore, are often ineffective, since many words in the source sentences remain unchanged in both the hypothesis and the reference. This study focuses on edits specifically designed for GEC, i.e., ERRANT, and computes similarity measured over the edits from the source sentence. To this end, we propose edit vector, a representation for an edit, and introduce a new metric, UOT-ERRANT, which transports these edit vectors from hypothesis to reference using unbalanced optimal transport. Experiments with SEEDA meta-evaluation show that UOT-ERRANT improves evaluation performance, particularly in the +Fluency domain where many edits occur. Moreover, our method is highly interpretable because the transport plan can be interpreted as a soft edit alignment, making UOT-ERRANT a useful metric for both system ranking and analyzing GEC systems. Our code is available from https://github.com/gotutiyan/uot-errant.
翻译:语法纠错(GEC)中的自动评估对于选择性能最佳的系统至关重要。目前,基于参考的指标是一种流行选择,其本质上是衡量假设句与参考句之间的相似性。然而,基于嵌入的相似性度量(如BERTScore)通常效果不佳,因为源句中的许多词在假设句和参考句中均保持不变。本研究专注于为GEC专门设计的编辑操作,即ERRANT,并计算从源句出发的编辑操作之间的相似性。为此,我们提出了编辑向量作为编辑操作的一种表示,并引入了一种新指标UOT-ERRANT,该指标使用非平衡最优传输将这些编辑向量从假设句传输到参考句。基于SEEDA元评估的实验表明,UOT-ERRANT提升了评估性能,尤其在发生大量编辑的+Fluency领域。此外,我们的方法具有高度可解释性,因为传输计划可被解释为一种软编辑对齐,这使得UOT-ERRANT成为一个对系统排序和分析GEC系统都有用的指标。我们的代码可从 https://github.com/gotutiyan/uot-errant 获取。