Many multilingual communities, including numerous in Africa, frequently engage in code-switching during conversations. This behaviour stresses the need for natural language processing technologies adept at processing code-switched text. However, data scarcity, particularly in African languages, poses a significant challenge, as many are low-resourced and under-represented. In this study, we prompted GPT 3.5 to generate Afrikaans--English and Yoruba--English code-switched sentences, enhancing diversity using topic-keyword pairs, linguistic guidelines, and few-shot examples. Our findings indicate that the quality of generated sentences for languages using non-Latin scripts, like Yoruba, is considerably lower when compared with the high Afrikaans-English success rate. There is therefore a notable opportunity to refine prompting guidelines to yield sentences suitable for the fine-tuning of language models. We propose a framework for augmenting the diversity of synthetically generated code-switched data using GPT and propose leveraging this technology to mitigate data scarcity in low-resourced languages, underscoring the essential role of native speakers in this process.
翻译:许多多语社区,包括非洲的众多社区,在对话中频繁进行语码转换。这种行为凸显了对能够处理语码转换文本的自然语言处理技术的需求。然而,数据稀缺性,尤其是在非洲语言中,构成了重大挑战,因为许多语言属于低资源和代表性不足的语言。在本研究中,我们提示GPT 3.5生成南非荷兰语-英语和约鲁巴语-英语的语码转换句子,通过主题-关键词对、语言指南和少样本示例增强多样性。我们的研究发现,对于使用非拉丁文字的语言(如约鲁巴语),生成句子的质量显著低于南非荷兰语-英语的高成功率。因此,存在显著的机会来优化提示指南,以生成适合语言模型微调的句子。我们提出了一个框架,利用GPT增强合成生成的语码转换数据的多样性,并建议利用这项技术缓解低资源语言中的数据稀缺性,同时强调母语者在此过程中不可或缺的作用。