With the rise of globalisation, code-switching (CSW) has become a ubiquitous part of multilingual conversation, posing new challenges for natural language processing (NLP), especially in Grammatical Error Correction (GEC). This work explores the complexities of applying GEC systems to CSW texts. Our objectives include evaluating the performance of state-of-the-art GEC systems on an authentic CSW dataset from English as a Second Language (ESL) learners, exploring synthetic data generation as a solution to data scarcity, and developing a model capable of correcting grammatical errors in monolingual and CSW texts. We generated synthetic CSW GEC data, resulting in one of the first substantial datasets for this task, and showed that a model trained on this data is capable of significant improvements over existing systems. This work targets ESL learners, aiming to provide educational technologies that aid in the development of their English grammatical correctness without constraining their natural multilingualism.
翻译:随着全球化的兴起,语码转换已成为多语言交流中普遍存在的现象,这给自然语言处理领域带来了新的挑战,特别是在语法错误纠正任务中。本研究探讨了将GEC系统应用于语码转换文本的复杂性。我们的目标包括:评估最先进的GEC系统在来自英语作为第二语言学习者的真实语码转换数据集上的性能,探索合成数据生成作为数据稀缺问题的解决方案,以及开发能够纠正单语和语码转换文本中语法错误的模型。我们生成了合成的语码转换GEC数据,构建了该任务首批大规模数据集之一,并证明基于此数据训练的模型能够较现有系统实现显著改进。本研究面向ESL学习者,旨在提供教育技术支持其英语语法准确性的发展,同时不限制其自然的多语言表达能力。