Although rarely stated, in practice, Grammatical Error Correction (GEC) encompasses various models with distinct objectives, ranging from grammatical error detection to improving fluency. Traditional evaluation methods fail to fully capture the full range of system capabilities and objectives. Reference-based evaluations suffer from limitations in capturing the wide variety of possible correction and the biases introduced during reference creation and is prone to favor fixing local errors over overall text improvement. The emergence of large language models (LLMs) has further highlighted the shortcomings of these evaluation strategies, emphasizing the need for a paradigm shift in evaluation methodology. In the current study, we perform a comprehensive evaluation of various GEC systems using a recently published dataset of Swedish learner texts. The evaluation is performed using established evaluation metrics as well as human judges. We find that GPT-3 in a few-shot setting by far outperforms previous grammatical error correction systems for Swedish, a language comprising only 0.11% of its training data. We also found that current evaluation methods contain undesirable biases that a human evaluation is able to reveal. We suggest using human post-editing of GEC system outputs to analyze the amount of change required to reach native-level human performance on the task, and provide a dataset annotated with human post-edits and assessments of grammaticality, fluency and meaning preservation of GEC system outputs.
翻译:尽管通常未明确表述,但在实践中,语法纠错涵盖了具有不同目标的多种模型,其范围从语法错误检测到流畅度提升。传统评估方法未能充分捕捉系统的全部能力与目标。基于参考标准的评估存在局限性,难以反映可能纠正方式的多样性,且参考标准创建过程中引入的偏差容易导致该方法更偏向修复局部错误而非整体文本改进。大语言模型的出现进一步凸显了这些评估策略的缺陷,强调评估方法论亟需范式转变。本研究利用近期发布的瑞典语学习者文本数据集,对多种语法纠错系统进行了全面评估。评估过程采用既有评价指标与人工评判相结合的方式。研究发现,在少样本场景下,GPT-3的性能大幅超越此前针对瑞典语(仅占其训练数据0.11%的语言)的语法纠错系统。此外,当前评估方法包含的不良偏差可通过人工评估加以揭示。我们建议采用对语法纠错系统输出的人工后编辑方法,以分析达到该任务母语水平所需修改量,并提供一个包含人工后编辑标注及语法性、流畅度、语义保留评估的数据集。