Code-switching (CSW) is a common phenomenon among multilingual speakers where multiple languages are used in a single discourse or utterance. Mixed language utterances may still contain grammatical errors however, yet most existing Grammar Error Correction (GEC) systems have been trained on monolingual data and not developed with CSW in mind. In this work, we conduct the first exploration into the use of GEC systems on CSW text. Through this exploration, we propose a novel method of generating synthetic CSW GEC datasets by translating different spans of text within existing GEC corpora. We then investigate different methods of selecting these spans based on CSW ratio, switch-point factor and linguistic constraints, and identify how they affect the performance of GEC systems on CSW text. Our best model achieves an average increase of 1.57 $F_{0.5}$ across 3 CSW test sets (English-Chinese, English-Korean and English-Japanese) without affecting the model's performance on a monolingual dataset. We furthermore discovered that models trained on one CSW language generalise relatively well to other typologically similar CSW languages.
翻译:代码转换(CSW)是多语言使用者中的常见现象,指在单一语篇或话语中使用多种语言。然而,混合语言话语中仍可能包含语法错误,而现有的语法错误纠正(GEC)系统大多基于单语数据训练,并未针对CSW场景进行开发。在本研究中,我们首次探索了将GEC系统应用于CSW文本的方法。通过这一探索,我们提出了一种新颖的合成CSW GEC数据集生成方法,即对现有GEC语料库中不同文本片段进行翻译。随后,我们研究了基于CSW比率、切换点因子及语言约束选择这些片段的不同方法,并分析了它们对CSW文本中GEC系统性能的影响。我们的最佳模型在3个CSW测试集(英-中、英-韩和英-日)上的平均$F_{0.5}$值提升了1.57,同时未影响模型在单语数据集上的表现。此外,我们发现,在一种CSW语言上训练的模型能较好地泛化到其他类型学上相似的CSW语言。