Research on Korean grammatical error correction (GEC) is limited, compared to other major languages such as English. We attribute this problematic circumstance to the lack of a carefully designed evaluation benchmark for Korean GEC. In this work, we collect three datasets from different sources (Kor-Lang8, Kor-Native, and Kor-Learner) that covers a wide range of Korean grammatical errors. Considering the nature of Korean grammar, We then define 14 error types for Korean and provide KAGAS (Korean Automatic Grammatical error Annotation System), which can automatically annotate error types from parallel corpora. We use KAGAS on our datasets to make an evaluation benchmark for Korean, and present baseline models trained from our datasets. We show that the model trained with our datasets significantly outperforms the currently used statistical Korean GEC system (Hanspell) on a wider range of error types, demonstrating the diversity and usefulness of the datasets. The implementations and datasets are open-sourced.
翻译:相较于英语等其他主要语言,韩语语法错误修正(GEC)的研究十分有限。我们将这一现状归因于缺乏精心设计的韩语GEC评估基准。本研究从三个不同来源(Kor-Lang8、Kor-Native、Kor-Learner)收集了涵盖多种韩语语法错误的数据集。结合韩语语法特点,我们定义了14种韩语错误类型,并提出了KAGAS(韩语自动语法错误标注系统),该系统能够自动从平行语料中标注错误类型。我们利用KAGAS对数据集进行标注以构建韩语评估基准,并基于这些数据集训练了基线模型。实验表明,与当前广泛使用的统计型韩语GEC系统(Hanspell)相比,基于我们数据集训练的模型在更广泛的错误类型上表现显著更优,验证了数据集的多样性和实用性。相关实现代码与数据集均已开源。