GPT-3 and GPT-4 models are powerful, achieving high performance on a variety of Natural Language Processing tasks. However, there is a relative lack of detailed published analysis of their performance on the task of grammatical error correction (GEC). To address this, we perform experiments testing the capabilities of a GPT-3.5 model (text-davinci-003) and a GPT-4 model (gpt-4-0314) on major GEC benchmarks. We compare the performance of different prompts in both zero-shot and few-shot settings, analyzing intriguing or problematic outputs encountered with different prompt formats. We report the performance of our best prompt on the BEA-2019 and JFLEG datasets, finding that the GPT models can perform well in a sentence-level revision setting, with GPT-4 achieving a new high score on the JFLEG benchmark. Through human evaluation experiments, we compare the GPT models' corrections to source, human reference, and baseline GEC system sentences and observe differences in editing strategies and how they are scored by human raters.
翻译:GPT-3和GPT-4模型功能强大,在多种自然语言处理任务中均展现出卓越性能。然而,关于它们在语法纠错(GEC)任务中表现的具体分析却相对缺乏。为解决这一问题,我们通过实验测试了GPT-3.5模型(text-davinci-003)和GPT-4模型(gpt-4-0314)在主流GEC基准测试中的能力。我们比较了零样本和少样本设置下不同提示词的表现,分析了不同提示词格式中出现的引人关注或存在问题的输出结果。报告了在BEA-2019和JFLEG数据集上最优提示词的表现,发现GPT模型在句子级修订场景中表现优异,其中GPT-4在JFLEG基准测试中创下新高。通过人工评估实验,我们将GPT模型的纠错结果与原文、人工参考及基线GEC系统生成的句子进行了对比,观察到编辑策略差异及其对人类评分员评估结果的影响。