Thanks to recent advances in generative AI, we are able to prompt large language models (LLMs) to produce texts which are fluent and grammatical. In addition, it has been shown that we can elicit attempts at grammatical error correction (GEC) from LLMs when prompted with ungrammatical input sentences. We evaluate how well LLMs can perform at GEC by measuring their performance on established benchmark datasets. We go beyond previous studies, which only examined GPT* models on a selection of English GEC datasets, by evaluating seven open-source and three commercial LLMs on four established GEC benchmarks. We investigate model performance and report results against individual error types. Our results indicate that LLMs do not always outperform supervised English GEC models except in specific contexts -- namely commercial LLMs on benchmarks annotated with fluency corrections as opposed to minimal edits. We find that several open-source models outperform commercial ones on minimal edit benchmarks, and that in some settings zero-shot prompting is just as competitive as few-shot prompting.
翻译:得益于生成式AI的最新进展,我们能够通过提示大型语言模型(LLMs)生成流畅且语法正确的文本。此外,研究表明,当输入不合语法的句子时,这些模型也能尝试进行语法错误纠正(GEC)。通过在既有基准数据集上评估性能,我们衡量了LLMs在GEC任务中的表现。本研究超越了以往仅针对特定GPT模型在部分英语GEC数据集上的分析,系统评估了七个开源模型与三个商业模型在四个成熟GEC基准上的表现。我们考察模型性能,并针对不同错误类型报告结果。研究表明,LLMs在特定情境下(即采用流畅性修正标注的基准测试中,而非最小编辑标注的基准测试)并非始终优于有监督的英语GEC模型——尤其商业模型在流畅性修正基准上表现更佳。我们发现在最小编辑基准上,多个开源模型的表现优于商业模型;而在某些场景下,零样本提示与少样本提示的效果同样具有竞争力。