Tibyan Corpus: Balanced and Comprehensive Error Coverage Corpus Using ChatGPT for Arabic Grammatical Error Correction

Natural language processing (NLP) utilizes text data augmentation to overcome sample size constraints. Increasing the sample size is a natural and widely used strategy for alleviating these challenges. In this study, we chose Arabic to increase the sample size and correct grammatical errors. Arabic is considered one of the languages with limited resources for grammatical error correction (GEC). Furthermore, QALB-14 and QALB-15 are the only datasets used in most Arabic grammatical error correction research, with approximately 20,500 parallel examples, which is considered low compared with other languages. Therefore, this study aims to develop an Arabic corpus called "Tibyan" for grammatical error correction using ChatGPT. ChatGPT is used as a data augmenter tool based on a pair of Arabic sentences containing grammatical errors matched with a sentence free of errors extracted from Arabic books, called guide sentences. Multiple steps were involved in establishing our corpus, including the collection and pre-processing of a pair of Arabic texts from various sources, such as books and open-access corpora. We then used ChatGPT to generate a parallel corpus based on the text collected previously, as a guide for generating sentences with multiple types of errors. By engaging linguistic experts to review and validate the automatically generated sentences, we ensured that they were correct and error-free. The corpus was validated and refined iteratively based on feedback provided by linguistic experts to improve its accuracy. Finally, we used the Arabic Error Type Annotation tool (ARETA) to analyze the types of errors in the Tibyan corpus. Our corpus contained 49 of errors, including seven types: orthography, morphology, syntax, semantics, punctuation, merge, and split. The Tibyan corpus contains approximately 600 K tokens.

翻译：自然语言处理（NLP）领域常采用文本数据增强技术以克服样本规模限制。扩大样本量是缓解此类挑战的天然且广泛应用的策略。本研究选取阿拉伯语作为扩大样本规模及纠正语法错误的对象。阿拉伯语被视为语法纠错（GEC）资源有限的语言之一。此外，QALB-14与QALB-15是当前阿拉伯语语法纠错研究中使用的主要数据集，仅包含约20,500组平行例句，相较于其他语言而言规模偏小。为此，本研究旨在利用ChatGPT构建名为"Tibyan"的阿拉伯语语法纠错语料库。ChatGPT作为数据增强工具，基于从阿拉伯语书籍提取的包含语法错误的阿拉伯语句对及其对应的无误导引句进行工作。语料库构建过程包含多个步骤：首先从书籍及开放存取语料库等多种来源收集阿拉伯语文本对并进行预处理；随后以既有文本为导引，运用ChatGPT生成包含多类错误的平行语料库；通过语言学专家对自动生成句子的审阅验证，确保其正确性与无谬性；依据专家反馈进行迭代验证与优化以提升语料库准确性；最终采用阿拉伯语错误类型标注工具（ARETA）分析Tibyan语料库的错误类型。本语料库涵盖49种错误类型，包含正字法、形态学、句法学、语义学、标点符号、合并与拆分七大类，总规模约达60万词元。