ChatAug: Leveraging ChatGPT for Text Data Augmentation

Haixing Dai,Zhengliang Liu,Wenxiong Liao,Xiaoke Huang,Zihao Wu,Lin Zhao,Wei Liu,Ninghao Liu,Sheng Li,Dajiang Zhu,Hongmin Cai,Quanzheng Li,Dinggang Shen,Tianming Liu,Xiang Li

Text data augmentation is an effective strategy for overcoming the challenge of limited sample sizes in many natural language processing (NLP) tasks. This challenge is especially prominent in the few-shot learning scenario, where the data in the target domain is generally much scarcer and of lowered quality. A natural and widely-used strategy to mitigate such challenges is to perform data augmentation on the training data to better capture the data invariance and increase the sample size. However, current text data augmentation methods either can not ensure the correct labeling of the generated data (lacking faithfulness) or can not ensure sufficient diversity in the generated data (lacking completeness), or both. Inspired by the recent success of large language models, especially the development of ChatGPT, which demonstrated improved language comprehension abilities, in this work, we propose a text data augmentation approach based on ChatGPT (named ChatAug). ChatGPT is trained on data with unparalleled linguistic richness and employs a reinforcement training process with large-scale human feedback, which endows the model with affinity to the naturalness of human language. Our text data augmentation approach ChatAug rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples. The augmented samples can then be used in downstream model training. Experiment results on few-shot learning text classification tasks show the superior performance of the proposed ChatAug approach over state-of-the-art text data augmentation methods in terms of testing accuracy and distribution of the augmented samples.

翻译：摘要：文本数据增强是克服自然语言处理（NLP）任务中有限样本量挑战的有效策略。这一挑战在小样本学习场景中尤为突出，其中目标领域的数据通常更为稀缺且质量较低。缓解此类挑战的一种自然且广泛使用的策略是在训练数据上执行数据增强，以更好地捕捉数据不变性并增加样本量。然而，当前的文本数据增强方法要么无法确保生成数据的正确标注（缺乏保真性），要么无法确保生成数据具有足够的多样性（缺乏完整性），或者两者兼而有之。受近期大规模语言模型成功经验的启发，特别是展现出更强语言理解能力的ChatGPT的发展，本研究提出了一种基于ChatGPT的文本数据增强方法（命名为ChatAug）。ChatGPT在具有无与伦比语言丰富性的数据上进行训练，并采用带有大规模人类反馈的强化训练过程，这使得模型对自然语言的自然性具有亲和力。我们的文本数据增强方法ChatAug将训练样本中的每个句子改写为多个概念上相似但语义上不同的样本。增强后的样本随后可用于下游模型训练。在小样本学习文本分类任务上的实验结果表明，所提出的ChatAug方法在测试准确性和增强样本分布方面相较于最先进的文本数据增强方法具有优越性能。