Code-switching (CS) is still a critical challenge in Natural Language Processing (NLP). Current Large Language Models (LLMs) struggle to interpret and generate code-switched text, primarily due to the scarcity of large-scale CS datasets for training. This paper presents a novel methodology to generate CS data using LLMs, and test it on the English-Spanish language pair. We propose back-translating natural CS sentences into monolingual English, and using the resulting parallel corpus to fine-tune LLMs to turn monolingual sentences into CS. Unlike previous approaches to CS generation, our methodology uses natural CS data as a starting point, allowing models to learn its natural distribution beyond grammatical patterns. We thoroughly analyse the models' performance through a study on human preferences, a qualitative error analysis and an evaluation with popular automatic metrics. Results show that our methodology generates fluent code-switched text, expanding research opportunities in CS communication, and that traditional metrics do not correlate with human judgement when assessing the quality of the generated CS data. We release our code and generated dataset under a CC-BY-NC-SA license.
翻译:代码混合(CS)在自然语言处理(NLP)领域仍是关键挑战。当前的大型语言模型(LLMs)在理解和生成代码混合文本方面存在困难,主要原因是缺乏大规模用于训练的CS数据集。本文提出一种利用LLMs生成CS数据的新方法,并以英语-西班牙语语言对进行测试。我们提出将自然CS句子回译为单语英语,并利用生成的平行语料库对LLMs进行微调,使其能够将单语句子转换为CS文本。与以往的CS生成方法不同,我们的方法以自然CS数据为起点,使模型能够超越语法模式学习其自然分布。我们通过人类偏好研究、定性错误分析和常用自动指标评估,对模型性能进行了全面分析。结果表明,该方法能生成流畅的代码混合文本,拓展了CS通信的研究空间,同时发现传统评估指标在衡量生成CS数据质量时与人类判断缺乏相关性。我们在CC-BY-NC-SA许可下公开了代码和生成的数据集。