Grammatical Error Correction has seen significant progress with the recent advancements in deep learning. As those methods require huge amounts of data, synthetic datasets are being built to fill this gap. Unfortunately, synthetic datasets are not organic enough in some cases and even require clean data to start with. Furthermore, most of the work that has been done is focused mostly on English. In this work, we introduce a new organic data-driven approach, clean insertions, to build parallel Turkish Grammatical Error Correction datasets from any organic data, and to clean the data used for training Large Language Models. We achieve state-of-the-art results on two Turkish Grammatical Error Correction test sets out of the three publicly available ones. We also show the effectiveness of our method on the training losses of training language models.
翻译:随着深度学习的最新进展,语法错误校正领域已取得显著进步。由于这些方法需要海量数据,目前正在构建合成数据集以填补这一缺口。然而在某些情况下,合成数据集的有机性不足,甚至需要以洁净数据作为起点。此外,现有研究大多集中于英语领域。本研究提出一种新型有机数据驱动方法——洁净插入法,能够从任意有机数据构建土耳其语语法错误校正的平行数据集,并用于净化训练大语言模型的数据。我们在三个公开可用的土耳其语语法错误校正测试集中,其中两个上取得了最先进的结果。同时,我们通过训练语言模型的损失函数验证了该方法的有效性。