Prompting Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages - 专知论文

会员服务 ·

0

混合 · 混合数据 · 语言模型化 · ChatGPT · CASE ·

2023 年 3 月 23 日

Prompting Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

翻译：提示大型语言模型生成代码混合文本：东南亚语言案例

Zheng-Xin Yong,Ruochen Zhang,Jessica Zosa Forde,Skyler Wang,Samuel Cahyawijaya,Holy Lovenia,Lintang Sutawika,Jan Christian Blaise Cruz,Long Phan,Yin Lin Tan,Alham Fikri Aji

While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The proliferation of Large Language Models (LLMs) in recent times compels one to ask: can these systems be used for data generation? In this article, we explore prompting LLMs in a zero-shot manner to create code-mixed data for five languages in South East Asia (SEA) -- Indonesian, Malay, Chinese, Tagalog, Vietnamese, as well as the creole language Singlish. We find that ChatGPT shows the most potential, capable of producing code-mixed text 68% of the time when the term "code-mixing" is explicitly defined. Moreover, both ChatGPT and InstructGPT's (davinci-003) performances in generating Singlish texts are noteworthy, averaging a 96% success rate across a variety of prompts. The code-mixing proficiency of ChatGPT and InstructGPT, however, is dampened by word choice errors that lead to semantic inaccuracies. Other multilingual models such as BLOOMZ and Flan-T5-XXL are unable to produce code-mixed texts altogether. By highlighting the limited promises of LLMs in a specific form of low-resource data generation, we call for a measured approach when applying similar techniques to other data-scarce NLP contexts.

翻译：尽管代码混合是世界许多地区常见的语言实践，但收集高质量且低成本的代码混合数据仍是自然语言处理（NLP）研究面临的挑战。近年来大型语言模型（LLMs）的普及引发了一个问题：这些系统能否用于数据生成？本文探索以零样本方式提示LLMs，为东南亚（SEA）五种语言——印尼语、马来语、中文、他加禄语、越南语以及克里奥尔语新式英语生成代码混合数据。研究发现，当明确界定"代码混合"术语时，ChatGPT展现出最大潜力，能在68%的情况下生成代码混合文本。此外，ChatGPT和InstructGPT（davinci-003）在生成新式英语文本方面的表现尤为突出，在多种提示条件下平均成功率高达96%。然而，ChatGPT和InstructGPT的代码混合能力因用词错误导致的语义不准确而受到削弱。其他多语言模型如BLOOMZ和Flan-T5-XXL则完全无法生成代码混合文本。通过揭示LLMs在特定低资源数据生成形式中的有限前景，我们呼吁在将类似技术应用于其他数据稀缺的NLP场景时采取审慎态度。

0

相关内容

NeurlPS 2022 | 自然语言处理相关论文分类整理

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

【2022新书】Transformer自然语言处理，Natural Language Processing with Transformers: Building Language Applications with Hugging Face

【2022新书】Transformer自然语言处理，Natural Language Processing with Transformers: Building Language Applications with Hugging Face

专知会员服务

524+阅读 · 2022年1月31日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

Python计算导论，560页pdf，Introduction to Computing Using Python

Python计算导论，560页pdf，Introduction to Computing Using Python

专知会员服务

77+阅读 · 2020年5月5日

【微软】大型神经语言模型的对抗性训练，Adversarial Training for Large Neural Language Models

【微软】大型神经语言模型的对抗性训练，Adversarial Training for Large Neural Language Models

专知会员服务

51+阅读 · 2020年5月3日

【BERT改进ElasticSearch的相关准确率80%以上】NBoost: Boost Elasticsearch Search Relevance by 80% with BERT

【BERT改进ElasticSearch的相关准确率80%以上】NBoost: Boost Elasticsearch Search Relevance by 80% with BERT

专知会员服务

13+阅读 · 2019年11月26日

【AAAI2020论文】关注实体以更好地理解文本（Attending to Entities for Better Text Understanding）

【AAAI2020论文】关注实体以更好地理解文本（Attending to Entities for Better Text Understanding）

专知会员服务

25+阅读 · 2019年11月15日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

NLP 2018 Highlights：2018自然语言处理技术亮点汇总

NLP 2018 Highlights：2018自然语言处理技术亮点汇总

AINLP

10+阅读 · 2019年2月9日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新八篇情感分析相关论文—Pair-wise判别器、多模态情感分析、上下文语境、Gated 卷积网络

【论文推荐】最新八篇情感分析相关论文—Pair-wise判别器、多模态情感分析、上下文语境、Gated 卷积网络

专知

20+阅读 · 2018年6月29日

【论文推荐】最新五篇命名实体识别相关论文—深度主动学习、Lattice LSTM、混合马尔可夫CRF

【论文推荐】最新五篇命名实体识别相关论文—深度主动学习、Lattice LSTM、混合马尔可夫CRF

专知

26+阅读 · 2018年5月22日

【论文推荐】最新六篇视觉问答（VQA）相关论文—盲人问题、物体计数、多模态解释、视觉关系、对抗性网络、对偶循环注意力

【论文推荐】最新六篇视觉问答（VQA）相关论文—盲人问题、物体计数、多模态解释、视觉关系、对抗性网络、对偶循环注意力

专知

32+阅读 · 2018年2月28日

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

专知

37+阅读 · 2018年2月21日

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

专知

27+阅读 · 2018年2月7日

Zakharov系统的解的动力学行为研究

国家自然科学基金

0+阅读 · 2015年12月31日

小系统中的反常扩散和Kramers逃逸问题研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于LAMOST数据Mg超丰恒星的搜寻及研究

国家自然科学基金

0+阅读 · 2015年12月31日

利用贝叶斯方法估计LAMOST恒星参数

国家自然科学基金

2+阅读 · 2015年12月31日

混凝土Weibull统计尺寸效应理论模型改进研究

国家自然科学基金

0+阅读 · 2013年12月31日

准晶非线性断裂力学中Dugdale模型的复变函数解法

国家自然科学基金

0+阅读 · 2013年12月31日

托卡马克边缘局域模输运和能流沉积的动力学模拟研究

国家自然科学基金

0+阅读 · 2013年12月31日

Schrodinger-Poisson方程的若干问题研究

国家自然科学基金

1+阅读 · 2012年12月31日

几个非线性Schrodinger方程组模型及相关问题研究

国家自然科学基金

0+阅读 · 2012年12月31日

大质量恒星形成环境的红外亚毫米波观测新研究

国家自然科学基金

0+阅读 · 2011年12月31日

StructGPT: A General Framework for Large Language Model to Reason over Structured Data

StructGPT: A General Framework for Large Language Model to Reason over Structured Data

Arxiv

0+阅读 · 2023年5月16日

GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information

Arxiv

0+阅读 · 2023年5月16日

Large Language Models are Zero-Shot Rankers for Recommender Systems

Arxiv

0+阅读 · 2023年5月15日

Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages

Arxiv

0+阅读 · 2023年5月15日

Parallel Context Windows for Large Language Models

Arxiv

1+阅读 · 2023年5月15日

Using LLM-assisted Annotation for Corpus Linguistics: A Case Study of Local Grammar Analysis

Arxiv

0+阅读 · 2023年5月15日

Evaluating Open-Domain Question Answering in the Era of Large Language Models

Arxiv

0+阅读 · 2023年5月14日

Knowledge Refinement via Interaction Between Search Engines and Large Language Models

Arxiv

0+阅读 · 2023年5月12日

SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models

Arxiv

0+阅读 · 2023年5月12日

Davinci the Dualist: the mind-body divide in large language models and in human learners

Arxiv

0+阅读 · 2023年5月10日

VIP会员

文章信息

相关主题

语言模型化

最新内容

博士论文 | 面向大模型推理的内存高效算法

博士论文 | 面向大模型推理的内存高效算法

专知会员服务

2+阅读 · 7月27日

论文解读 | 从预训练到后训练：理解大模型推理能力如何形成

论文解读 | 从预训练到后训练：理解大模型推理能力如何形成

专知会员服务

3+阅读 · 7月27日

《无人系统互操作性导论——无人系统联合架构（JAUS）》

《无人系统互操作性导论——无人系统联合架构（JAUS）》

专知会员服务

9+阅读 · 7月27日

美空军新型反无人机部队初探

美空军新型反无人机部队初探

专知会员服务

5+阅读 · 7月27日

《对抗性电磁环境下远程巡飞弹作战的安全指挥与控制数据链》

《对抗性电磁环境下远程巡飞弹作战的安全指挥与控制数据链》

专知会员服务

4+阅读 · 7月27日

《北约下一代建模与仿真（NexGen M&S）计划》2026年69页

《北约下一代建模与仿真（NexGen M&S）计划》2026年69页

专知会员服务

3+阅读 · 7月27日

《防空交战流程的概率建模研究》

《防空交战流程的概率建模研究》

专知会员服务

7+阅读 · 7月27日

ICML 2026 教程 | 数值优化理论还重要吗？

ICML 2026 教程 | 数值优化理论还重要吗？

专知会员服务

6+阅读 · 7月26日

ICM 2026 | 陶哲轩：人工智能时代的数学

ICM 2026 | 陶哲轩：人工智能时代的数学

专知会员服务

9+阅读 · 7月26日

《面向可扩展高韧性无人机集群网络的速度感知分层通信框架》

《面向可扩展高韧性无人机集群网络的速度感知分层通信框架》

专知会员服务

8+阅读 · 7月26日

《面向概率推理的可定制战术引擎及其在军事任务规划中的应用》

《面向概率推理的可定制战术引擎及其在军事任务规划中的应用》

专知会员服务

11+阅读 · 7月26日

《先进防空系统选型战略框架：基于巴基斯坦的实证启示》

《先进防空系统选型战略框架：基于巴基斯坦的实证启示》

专知会员服务

8+阅读 · 7月26日

《反无人机交战场景下的战斗归零研究》

《反无人机交战场景下的战斗归零研究》

专知会员服务

7+阅读 · 7月26日

霍尔木兹与不对称作战时代：水雷、无人系统与海军力量的重新定义

霍尔木兹与不对称作战时代：水雷、无人系统与海军力量的重新定义

专知会员服务

4+阅读 · 7月26日

博士论文 | 用代码结构感知方法推进代码大模型

博士论文 | 用代码结构感知方法推进代码大模型

专知会员服务

6+阅读 · 7月25日

相关VIP内容

NeurlPS 2022 | 自然语言处理相关论文分类整理

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

【2022新书】Transformer自然语言处理，Natural Language Processing with Transformers: Building Language Applications with Hugging Face

【2022新书】Transformer自然语言处理，Natural Language Processing with Transformers: Building Language Applications with Hugging Face

专知会员服务

524+阅读 · 2022年1月31日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

Python计算导论，560页pdf，Introduction to Computing Using Python

Python计算导论，560页pdf，Introduction to Computing Using Python

专知会员服务

77+阅读 · 2020年5月5日

【微软】大型神经语言模型的对抗性训练，Adversarial Training for Large Neural Language Models

【微软】大型神经语言模型的对抗性训练，Adversarial Training for Large Neural Language Models

专知会员服务

51+阅读 · 2020年5月3日

【BERT改进ElasticSearch的相关准确率80%以上】NBoost: Boost Elasticsearch Search Relevance by 80% with BERT

【BERT改进ElasticSearch的相关准确率80%以上】NBoost: Boost Elasticsearch Search Relevance by 80% with BERT

专知会员服务

13+阅读 · 2019年11月26日

【AAAI2020论文】关注实体以更好地理解文本（Attending to Entities for Better Text Understanding）

【AAAI2020论文】关注实体以更好地理解文本（Attending to Entities for Better Text Understanding）

专知会员服务

25+阅读 · 2019年11月15日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

论文解读 | 从预训练到后训练：理解大模型推理能力如何形成

美空军新型反无人机部队初探

博士论文 | 面向大模型推理的内存高效算法

《无人系统互操作性导论——无人系统联合架构（JAUS）》

相关资讯

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

NLP 2018 Highlights：2018自然语言处理技术亮点汇总

NLP 2018 Highlights：2018自然语言处理技术亮点汇总

AINLP

10+阅读 · 2019年2月9日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新八篇情感分析相关论文—Pair-wise判别器、多模态情感分析、上下文语境、Gated 卷积网络

【论文推荐】最新八篇情感分析相关论文—Pair-wise判别器、多模态情感分析、上下文语境、Gated 卷积网络

专知

20+阅读 · 2018年6月29日

【论文推荐】最新五篇命名实体识别相关论文—深度主动学习、Lattice LSTM、混合马尔可夫CRF

【论文推荐】最新五篇命名实体识别相关论文—深度主动学习、Lattice LSTM、混合马尔可夫CRF

专知

26+阅读 · 2018年5月22日

【论文推荐】最新六篇视觉问答（VQA）相关论文—盲人问题、物体计数、多模态解释、视觉关系、对抗性网络、对偶循环注意力

【论文推荐】最新六篇视觉问答（VQA）相关论文—盲人问题、物体计数、多模态解释、视觉关系、对抗性网络、对偶循环注意力

专知

32+阅读 · 2018年2月28日

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

【论文推荐】最新五篇命名实体识别（NER）相关论文—对抗学习、语料库、深度多任务学习、先验知识、跨语言语义

专知

37+阅读 · 2018年2月21日

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

专知

27+阅读 · 2018年2月7日

相关论文

StructGPT: A General Framework for Large Language Model to Reason over Structured Data

StructGPT: A General Framework for Large Language Model to Reason over Structured Data

Arxiv

0+阅读 · 2023年5月16日

GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information

Arxiv

0+阅读 · 2023年5月16日

Large Language Models are Zero-Shot Rankers for Recommender Systems

Arxiv

0+阅读 · 2023年5月15日

Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages

Arxiv

0+阅读 · 2023年5月15日

Parallel Context Windows for Large Language Models

Arxiv

1+阅读 · 2023年5月15日

Using LLM-assisted Annotation for Corpus Linguistics: A Case Study of Local Grammar Analysis

Arxiv

0+阅读 · 2023年5月15日

Evaluating Open-Domain Question Answering in the Era of Large Language Models

Arxiv

0+阅读 · 2023年5月14日

Knowledge Refinement via Interaction Between Search Engines and Large Language Models

Arxiv

0+阅读 · 2023年5月12日

SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models

Arxiv

0+阅读 · 2023年5月12日

Davinci the Dualist: the mind-body divide in large language models and in human learners

Arxiv

0+阅读 · 2023年5月10日

相关基金

Zakharov系统的解的动力学行为研究

国家自然科学基金

0+阅读 · 2015年12月31日

小系统中的反常扩散和Kramers逃逸问题研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于LAMOST数据Mg超丰恒星的搜寻及研究

国家自然科学基金

0+阅读 · 2015年12月31日

利用贝叶斯方法估计LAMOST恒星参数

国家自然科学基金

2+阅读 · 2015年12月31日

混凝土Weibull统计尺寸效应理论模型改进研究

国家自然科学基金

0+阅读 · 2013年12月31日

准晶非线性断裂力学中Dugdale模型的复变函数解法

国家自然科学基金

0+阅读 · 2013年12月31日

托卡马克边缘局域模输运和能流沉积的动力学模拟研究

国家自然科学基金

0+阅读 · 2013年12月31日

Schrodinger-Poisson方程的若干问题研究

国家自然科学基金

1+阅读 · 2012年12月31日

几个非线性Schrodinger方程组模型及相关问题研究

国家自然科学基金

0+阅读 · 2012年12月31日

大质量恒星形成环境的红外亚毫米波观测新研究

国家自然科学基金

0+阅读 · 2011年12月31日

微信扫码咨询专知VIP会员