Various evaluation metrics have been proposed for Grammatical Error Correction (GEC), but many, particularly reference-free metrics, lack explainability. This lack of explainability hinders researchers from analyzing the strengths and weaknesses of GEC models and limits the ability to provide detailed feedback for users. To address this issue, we propose attributing sentence-level scores to individual edits, providing insight into how specific corrections contribute to the overall performance. For the attribution method, we use Shapley values, from cooperative game theory, to compute the contribution of each edit. Experiments with existing sentence-level metrics demonstrate high consistency across different edit granularities and show approximately 70\% alignment with human evaluations. In addition, we analyze biases in the metrics based on the attribution results, revealing trends such as the tendency to ignore orthographic edits. Our implementation is available at \url{https://github.com/naist-nlp/gec-attribute}.
翻译:针对语法错误纠正(GEC)任务已提出多种评估指标,但其中许多指标(尤其是无参考指标)缺乏可解释性。这种可解释性的缺失阻碍了研究人员分析GEC模型的优缺点,也限制了对用户提供详细反馈的能力。为解决这一问题,我们提出将句子级评分归因至单个编辑操作,从而揭示具体修正如何影响整体性能表现。在归因方法上,我们采用合作博弈论中的沙普利值来计算每个编辑操作的贡献度。通过对现有句子级指标的实验验证,该方法在不同编辑粒度下表现出高度一致性,并与人工评估结果达到约70%的吻合度。此外,我们基于归因结果分析了指标中存在的偏差,揭示了诸如忽略拼写修正等倾向性规律。本研究的实现代码已发布于\url{https://github.com/naist-nlp/gec-attribute}。