Code-switching (CSW) is a common phenomenon among multilingual speakers where multiple languages are used in a single discourse or utterance. Mixed language utterances may still contain grammatical errors however, yet most existing Grammar Error Correction (GEC) systems have been trained on monolingual data and not developed with CSW in mind. In this work, we conduct the first exploration into the use of GEC systems on CSW text. Through this exploration, we propose a novel method of generating synthetic CSW GEC datasets by translating different spans of text within existing GEC corpora. We then investigate different methods of selecting these spans based on CSW ratio, switch-point factor and linguistic constraints, and identify how they affect the performance of GEC systems on CSW text. Our best model achieves an average increase of 1.57 $F_{0.5}$ across 3 CSW test sets (English-Chinese, English-Korean and English-Japanese) without affecting the model's performance on a monolingual dataset. We furthermore discovered that models trained on one CSW language generalise relatively well to other typologically similar CSW languages.
翻译:语码转换(CSW)是多语使用者中的常见现象,指在一个话语或语句中混合使用多种语言。然而,混合语言表达仍可能包含语法错误,现有的大多数语法错误纠正(GEC)系统均基于单语数据训练,并未考虑CSW场景。在本研究中,我们首次探索将GEC系统应用于CSW文本。通过这一探索,我们提出了一种基于现有GEC语料库中不同文本跨度翻译生成合成CSW GEC数据集的新方法。随后,我们研究了基于CSW比率、切换点因子和语言约束条件选择这些跨度的不同方法,并识别了它们对GEC系统在CSW文本上性能的影响。我们的最佳模型在3个CSW测试集(英-中、英-韩、英-日)上平均提升了1.57 $F_{0.5}$分数,且不影响模型在单语数据集上的性能。此外,我们发现,在一种CSW语言上训练的模型能够较好地泛化至其他类型学相似的CSW语言。