ChatAug: Leveraging ChatGPT for Text Data Augmentation

Haixing Dai,Zhengliang Liu,Wenxiong Liao,Xiaoke Huang,Zihao Wu,Lin Zhao,Wei Liu,Ninghao Liu,Sheng Li,Dajiang Zhu,Hongmin Cai,Quanzheng Li,Dinggang Shen,Tianming Liu,Xiang Li

Text data augmentation is an effective strategy for overcoming the challenge of limited sample sizes in many natural language processing (NLP) tasks. This challenge is especially prominent in the few-shot learning scenario, where the data in the target domain is generally much scarcer and of lowered quality. A natural and widely-used strategy to mitigate such challenges is to perform data augmentation on the training data to better capture the data invariance and increase the sample size. However, current text data augmentation methods either can not ensure the correct labeling of the generated data (lacking faithfulness) or can not ensure sufficient diversity in the generated data (lacking completeness), or both. Inspired by the recent success of large language models, especially the development of ChatGPT, which demonstrated improved language comprehension abilities, in this work, we propose a text data augmentation approach based on ChatGPT (named ChatAug). ChatGPT is trained on data with unparalleled linguistic richness and employs a reinforcement training process with large-scale human feedback, which endows the model with affinity to the naturalness of human language. Our text data augmentation approach ChatAug rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples. The augmented samples can then be used in downstream model training. Experiment results on few-shot learning text classification tasks show the superior performance of the proposed ChatAug approach over state-of-the-art text data augmentation methods in terms of testing accuracy and distribution of the augmented samples.

翻译：文本数据增强是一种有效策略，用于克服许多自然语言处理（NLP）任务中样本数量有限的挑战。这一挑战在少样本学习场景中尤为突出，此时目标领域的数据通常更为稀缺且质量较低。缓解此类挑战的一种自然且广泛使用的策略是对训练数据进行增强，以更好地捕捉数据不变性并增加样本量。然而，当前的文本数据增强方法要么无法确保生成数据的正确标注（缺乏忠实性），要么无法确保生成数据具有足够的多样性（缺乏完整性），或两者兼而有之。受最近大型语言模型成功应用的启发，特别是ChatGPT的发展，该模型展现出卓越的语言理解能力，在本工作中，我们提出了一种基于ChatGPT的文本数据增强方法（命名为ChatAug）。ChatGPT在具有无与伦比语言丰富性的数据上进行训练，并采用结合大规模人类反馈的强化训练过程，这赋予了模型对人类语言自然性的亲和力。我们的文本数据增强方法ChatAug将训练样本中的每个句子改写为多个概念相似但语义不同的样本。增强后的样本随后可用于下游模型训练。在少样本学习文本分类任务上的实验结果表明，所提出的ChatAug方法在测试准确性和增强样本分布方面均优于最先进的文本数据增强方法。