Grammatical error correction aims to correct ungrammatical sentences automatically. Recently, some work has demonstrated the excellent capabilities of closed-source Large Language Models (LLMs, e.g., ChatGPT) in grammatical error correction. However, the potential of open-source LLMs remains unexplored. In this paper, we introduced GrammarGPT, an open-source LLM, to preliminary explore its potential for native Chinese grammatical error correction. The core recipe of GrammarGPT is to leverage the hybrid dataset of ChatGPT-generated and human-annotated. For grammatical errors with clues, we proposed a heuristic method to guide ChatGPT to generate ungrammatical sentences by providing those clues. For grammatical errors without clues, we collected ungrammatical sentences from publicly available websites and manually corrected them. In addition, we employed an error-invariant augmentation method to enhance the ability of the model to correct native Chinese grammatical errors. We ultimately constructed about 1k parallel data and utilized these data to fine-tune open-source LLMs (e.g., Phoenix, released by The Chinese University of Hong Kong, Shenzhen) with instruction tuning. The experimental results show that GrammarGPT outperforms the existing SOTA system significantly. Although model parameters are 20x larger than the SOTA baseline, the required amount of data for instruction tuning is 1200x smaller, illustrating the potential of open-source LLMs on native CGEC. Our GrammarGPT ranks $3^{rd}$ on NLPCC2023 SharedTask1, demonstrating our approach's effectiveness. The code and data are available at \url{https://github.com/FreedomIntelligence/GrammarGPT}.
翻译:语法纠错旨在自动纠正不合语法的句子。近期,一些研究展示了闭源大语言模型(如ChatGPT)在语法纠错中的出色能力,但开源大语言模型的潜力尚未得到充分探索。本文介绍了开源模型GrammarGPT,初步探究其在中文母语语法纠错中的潜力。GrammarGPT的核心策略是利用ChatGPT生成与人工标注的混合数据集。针对有线索的语法错误,我们提出一种启发式方法,通过提供线索引导ChatGPT生成不合语法的句子;针对无线索的语法错误,我们从公开网站收集不合语法的句子并进行人工修正。此外,我们采用错误不变性增强方法提升模型对中文母语语法错误的修正能力。最终构建约1000条平行数据,并通过指令微调在开源大语言模型(如香港中文大学(深圳)发布的Phoenix)上进行训练。实验结果表明,GrammarGPT显著优于现有最优系统。尽管模型参数比最优基线大20倍,但指令微调所需数据量却小1200倍,充分展示了开源大语言模型在中文母语语法纠错中的潜力。我们的GrammarGPT在NLPCC2023 SharedTask1中排名第三,验证了方法的有效性。代码与数据已开源至 \url{https://github.com/FreedomIntelligence/GrammarGPT}。