Grammatical error correction aims to correct ungrammatical sentences automatically. Recently, some work has demonstrated the excellent capabilities of closed-source Large Language Models (LLMs, e.g., ChatGPT) in grammatical error correction. However, the potential of open-source LLMs remains unexplored. In this paper, we introduced GrammarGPT, an open-source LLM, to preliminary explore its potential for native Chinese grammatical error correction. The core recipe of GrammarGPT is to leverage the hybrid dataset of ChatGPT-generated and human-annotated. For grammatical errors with clues, we proposed a heuristic method to guide ChatGPT to generate ungrammatical sentences by providing those clues. For grammatical errors without clues, we collected ungrammatical sentences from publicly available websites and manually corrected them. In addition, we employed an error-invariant augmentation method to enhance the ability of the model to correct native Chinese grammatical errors. We ultimately constructed about 1k parallel data and utilized these data to fine-tune open-source LLMs (e.g., Phoenix, released by The Chinese University of Hong Kong, Shenzhen) with instruction tuning. The experimental results show that GrammarGPT outperforms the existing SOTA system significantly. Although model parameters are 20x larger than the SOTA baseline, the required amount of data for instruction tuning is 1200x smaller, illustrating the potential of open-source LLMs on native CGEC. Our GrammarGPT ranks $3^{rd}$ on NLPCC2023 SharedTask1, demonstrating our approach's effectiveness. The code and data are available at \url{https://github.com/FreedomIntelligence/GrammarGPT}.
翻译:摘要:语法纠错旨在自动纠正不合语法的句子。近期研究已证明闭源大语言模型(如ChatGPT)在语法纠错任务上具有卓越能力,但开源大语言模型的潜力尚未得到充分探索。本文提出开源模型GrammarGPT,初步探索其在中文母语语法纠错任务中的潜力。GrammarGPT的核心策略在于融合ChatGPT生成数据与人工标注数据的混合数据集。针对存在线索的语法错误,我们提出启发式方法,通过提供错误线索引导ChatGPT生成不合语法的句子;针对无明确线索的语法错误,我们从公开网站收集错误句子并进行人工校正。此外,我们采用错误不变性增强方法提升模型对中文母语语法错误的纠正能力。最终构建约1000条平行数据,通过指令微调技术对开源大语言模型(如香港中文大学(深圳)发布的Phoenix模型)进行优化。实验表明,GrammarGPT显著超越现有最优系统。尽管模型参数量比当前最优基线大20倍,但指令微调所需数据量却减少1200倍,充分展现了开源大语言模型在中文母语语法纠错领域的潜力。我们的GrammarGPT在NLPCC2023共享任务1中排名第三,验证了方法的有效性。代码与数据已开源至\url{https://github.com/FreedomIntelligence/GrammarGPT}。