Large Language Models (LLMs) have been reported to outperform existing automatic evaluation metrics in some tasks, such as text summarization and machine translation. However, there has been a lack of research on LLMs as evaluators in grammatical error correction (GEC). In this study, we investigate the performance of LLMs in GEC evaluation by employing prompts designed to incorporate various evaluation criteria inspired by previous research. Our extensive experimental results demonstrate that GPT-4 achieved Kendall's rank correlation of 0.662 with human judgments, surpassing all existing methods. Furthermore, in recent GEC evaluations, we have underscored the significance of the LLMs scale and particularly emphasized the importance of fluency among evaluation criteria.
翻译:大型语言模型(LLMs)已被报道在文本摘要和机器翻译等任务中优于现有的自动评估指标。然而,关于LLMs作为语法纠错(GEC)评估器的研究仍然缺乏。在本研究中,我们通过设计提示来融入先前研究启发的各种评估标准,探究了LLMs在GEC评估中的性能。我们广泛的实验结果表明,GPT-4与人类评判的肯德尔秩相关系数达到0.662,超越了所有现有方法。此外,在最近的GEC评估中,我们强调了LLMs规模的重要性,并特别指出了评估标准中流畅性的关键作用。