While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The recent proliferation of Large Language Models (LLMs) compels one to ask: how capable are these systems in generating code-mixed data? In this paper, we explore prompting multilingual LLMs in a zero-shot manner to generate code-mixed data for seven languages in South East Asia (SEA), namely Indonesian, Malay, Chinese, Tagalog, Vietnamese, Tamil, and Singlish. We find that publicly available multilingual instruction-tuned models such as BLOOMZ and Flan-T5-XXL are incapable of producing texts with phrases or clauses from different languages. ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing. For instance, ChatGPT generates fluent and natural Singlish texts (an English-based creole spoken in Singapore), but for English-Tamil language pair, the system mostly produces grammatically incorrect or semantically meaningless utterances. Furthermore, it may erroneously introduce languages not specified in the prompt. Based on our investigation, existing multilingual LLMs exhibit a wide range of proficiency in code-mixed data generation for SEA languages. As such, we advise against using LLMs in this context without extensive human checks.
翻译:语码混合是全球许多地区的常见语言实践,但收集高质量且低成本的语码混合数据始终是自然语言处理(NLP)领域的挑战。近年来大语言模型(LLMs)的蓬勃发展促使我们思考:这些系统在生成语码混合数据方面具备何种能力?本文探索了以零样本方式提示多语言LLMs,为七种东南亚语言(即印尼语、马来语、中文、他加禄语、越南语、泰米尔语和新加坡式英语)生成语码混合数据的方法。研究发现,公开可用的多语言指令微调模型(如BLOOMZ和Flan-T5-XXL)无法生成包含不同语言短语或从句的文本。ChatGPT在生成语码混合文本时表现出不一致的能力,其性能因提示模板和语言配对而异。例如,ChatGPT能生成流畅自然的新加坡式英语文本(一种基于英语的克里奥尔语,在新加坡使用),但在英语-泰米尔语配对中,系统主要产生语法错误或语义无意义的语句。此外,它可能错误地引入提示中未指定的语言。基于我们的调查,现有的大语言模型在生成东南亚语言语码混合数据时呈现出参差不齐的能力水平。因此,我们建议在此类应用中慎用大语言模型,除非经过充分的人工验证。