Conditioning LLMs to Generate Code-Switched Text: A Methodology Grounded in Naturally Occurring Data

Code-switching (CS) is still a critical challenge in Natural Language Processing (NLP). Current Large Language Models (LLMs) struggle to interpret and generate code-switched text, primarily due to the scarcity of large-scale CS datasets for training. This paper presents a novel methodology to generate CS data using LLMs, and test it on the English-Spanish language pair. We propose back-translating natural CS sentences into monolingual English, and using the resulting parallel corpus to fine-tune LLMs to turn monolingual sentences into CS. Unlike previous approaches to CS generation, our methodology uses natural CS data as a starting point, allowing models to learn its natural distribution beyond grammatical patterns. We thoroughly analyse the models' performance through a study on human preferences, a qualitative error analysis and an evaluation with popular automatic metrics. Results show that our methodology generates fluent code-switched text, expanding research opportunities in CS communication, and that traditional metrics do not correlate with human judgement when assessing the quality of the generated CS data. We release our code and generated dataset under a CC-BY-NC-SA license.

翻译：代码混合（CS）在自然语言处理（NLP）领域仍是关键挑战。当前的大型语言模型（LLMs）在理解和生成代码混合文本方面存在困难，主要原因是缺乏大规模用于训练的CS数据集。本文提出一种利用LLMs生成CS数据的新方法，并以英语-西班牙语语言对进行测试。我们提出将自然CS句子回译为单语英语，并利用生成的平行语料库对LLMs进行微调，使其能够将单语句子转换为CS文本。与以往的CS生成方法不同，我们的方法以自然CS数据为起点，使模型能够超越语法模式学习其自然分布。我们通过人类偏好研究、定性错误分析和常用自动指标评估，对模型性能进行了全面分析。结果表明，该方法能生成流畅的代码混合文本，拓展了CS通信的研究空间，同时发现传统评估指标在衡量生成CS数据质量时与人类判断缺乏相关性。我们在CC-BY-NC-SA许可下公开了代码和生成的数据集。

相关内容

计算机科学

关注 56

计算机科学（Computer Science, CS）是系统性研究信息与计算的理论基础以及它们在计算机系统中如何实现与应用的实用技术的学科。它通常被形容为对那些创造、描述以及转换信息的算法处理的系统研究。计算机科学包含很多分支领域；其中一些，比如计算机图形学强调特定结果的计算，而另外一些，比如计算复杂性理论是学习计算问题的性质。还有一些领域专注于挑战怎样实现计算。比如程序设计语言理论学习描述计算的方法，而程序设计是应用特定的程序设计语言解决特定的计算问题，人机交互则是专注于挑战怎样使计算机和计算变得有用、可用，以及随时随地为人所用。 现代计算机科学( Computer Science)包含理论计算机科学和应用计算机科学两大分支。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日