Currently, the majority of research in grammatical error correction (GEC) is concentrated on universal languages, such as English and Chinese. Many low-resource languages lack accessible evaluation corpora. How to efficiently construct high-quality evaluation corpora for GEC in low-resource languages has become a significant challenge. To fill these gaps, in this paper, we present a framework for constructing GEC corpora. Specifically, we focus on Indonesian as our research language and construct an evaluation corpus for Indonesian GEC using the proposed framework, addressing the limitations of existing evaluation corpora in Indonesian. Furthermore, we investigate the feasibility of utilizing existing large language models (LLMs), such as GPT-3.5-Turbo and GPT-4, to streamline corpus annotation efforts in GEC tasks. The results demonstrate significant potential for enhancing the performance of LLMs in low-resource language settings. Our code and corpus can be obtained from https://github.com/GKLMIP/GEC-Construction-Framework.
翻译:目前,语法纠错领域的研究主要集中在通用语言上,如英语和汉语。许多低资源语言缺乏可用的评估语料库。如何为低资源语言的语法纠错高效构建高质量的评估语料库已成为一项重大挑战。为填补这些空白,本文提出了一种语法纠错语料库构建框架。具体而言,我们以印度尼西亚语为目标语言,利用所提框架构建了印度尼西亚语语法纠错的评估语料库,解决了现有印度尼西亚语评估语料库的局限性。此外,我们研究了利用现有大语言模型(如GPT-3.5-Turbo和GPT-4)来简化语法纠错任务中语料标注工作的可行性。结果表明,在低资源语言环境下,大语言模型的性能提升具有显著潜力。我们的代码和语料库可从 https://github.com/GKLMIP/GEC-Construction-Framework 获取。