Grammatical error correction (GEC) is a well-explored problem in English with many existing models and datasets. However, research on GEC in morphologically rich languages has been limited due to challenges such as data scarcity and language complexity. In this paper, we present the first results on Arabic GEC using two newly developed Transformer-based pretrained sequence-to-sequence models. We also define the task of multi-class Arabic grammatical error detection (GED) and present the first results on multi-class Arabic GED. We show that using GED information as an auxiliary input in GEC models improves GEC performance across three datasets spanning different genres. Moreover, we also investigate the use of contextual morphological preprocessing in aiding GEC systems. Our models achieve SOTA results on two Arabic GEC shared task datasets and establish a strong benchmark on a recently created dataset. We make our code, data, and pretrained models publicly available.
翻译:语法错误纠正(GEC)在英语中是一个已被充分探索的问题,拥有众多现有模型与数据集。然而,针对形态丰富语言的GEC研究却因数据匮乏和语言复杂性等挑战而进展有限。本文首次报告了基于两个新开发的Transformer预训练序列到序列模型的阿拉伯语GEC结果。我们还定义了多类阿拉伯语语法错误检测(GED)任务,并首次展示了多类阿拉伯语GED的实验结果。研究表明,将GED信息作为辅助输入嵌入GEC模型,能够提升跨三种不同体裁数据集的GEC性能。此外,我们探索了通过上下文形态预处理辅助GEC系统的可能性。我们的模型在两个阿拉伯语GEC共享任务数据集上取得了最高水平结果,并在一个新近创建的数据集上建立了强基准。我们已将代码、数据和预训练模型公开。