While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The proliferation of Large Language Models (LLMs) in recent times compels one to ask: can these systems be used for data generation? In this article, we explore prompting LLMs in a zero-shot manner to create code-mixed data for five languages in South East Asia (SEA) -- Indonesian, Malay, Chinese, Tagalog, Vietnamese, as well as the creole language Singlish. We find that ChatGPT shows the most potential, capable of producing code-mixed text 68% of the time when the term "code-mixing" is explicitly defined. Moreover, both ChatGPT and InstructGPT's (davinci-003) performances in generating Singlish texts are noteworthy, averaging a 96% success rate across a variety of prompts. The code-mixing proficiency of ChatGPT and InstructGPT, however, is dampened by word choice errors that lead to semantic inaccuracies. Other multilingual models such as BLOOMZ and Flan-T5-XXL are unable to produce code-mixed texts altogether. By highlighting the limited promises of LLMs in a specific form of low-resource data generation, we call for a measured approach when applying similar techniques to other data-scarce NLP contexts.
翻译:尽管代码混合是世界许多地区常见的语言实践,但收集高质量且低成本的代码混合数据仍是自然语言处理(NLP)研究面临的挑战。近年来大型语言模型(LLMs)的普及引发了一个问题:这些系统能否用于数据生成?本文探索以零样本方式提示LLMs,为东南亚(SEA)五种语言——印尼语、马来语、中文、他加禄语、越南语以及克里奥尔语新式英语生成代码混合数据。研究发现,当明确界定"代码混合"术语时,ChatGPT展现出最大潜力,能在68%的情况下生成代码混合文本。此外,ChatGPT和InstructGPT(davinci-003)在生成新式英语文本方面的表现尤为突出,在多种提示条件下平均成功率高达96%。然而,ChatGPT和InstructGPT的代码混合能力因用词错误导致的语义不准确而受到削弱。其他多语言模型如BLOOMZ和Flan-T5-XXL则完全无法生成代码混合文本。通过揭示LLMs在特定低资源数据生成形式中的有限前景,我们呼吁在将类似技术应用于其他数据稀缺的NLP场景时采取审慎态度。