Grammatical error correction is one of the fundamental tasks in Natural Language Processing. For the Russian language, most of the spellcheckers available correct typos and other simple errors with high accuracy, but often fail when faced with non-native (L2) writing, since the latter contains errors that are not typical for native speakers. In this paper, we propose a pipeline involving a language model intended for correcting errors in L2 Russian writing. The language model proposed is trained on untagged texts of the Newspaper subcorpus of the Russian National Corpus, and the quality of the model is validated against the RULEC-GEC corpus.
翻译:语法纠错是自然语言处理中的基础任务之一。对于俄语而言,现有的拼写检查器能高精度地纠正打字错误及其他简单错误,但在处理非母语(L2)写作时往往失效,因为后者包含母语使用者不常见的错误。本文提出了一种结合语言模型的流水线方法,旨在纠正L2俄语写作中的错误。该语言模型基于俄罗斯国家语料库报纸子语料库的未标注文本进行训练,并通过RULEC-GEC语料库验证了模型的质量。