Text data augmentation is an effective strategy for overcoming the challenge of limited sample sizes in many natural language processing (NLP) tasks. This challenge is especially prominent in the few-shot learning scenario, where the data in the target domain is generally much scarcer and of lowered quality. A natural and widely-used strategy to mitigate such challenges is to perform data augmentation to better capture the data invariance and increase the sample size. However, current text data augmentation methods either can't ensure the correct labeling of the generated data (lacking faithfulness) or can't ensure sufficient diversity in the generated data (lacking compactness), or both. Inspired by the recent success of large language models, especially the development of ChatGPT, which demonstrated improved language comprehension abilities, in this work, we propose a text data augmentation approach based on ChatGPT (named AugGPT). AugGPT rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples. The augmented samples can then be used in downstream model training. Experiment results on few-shot learning text classification tasks show the superior performance of the proposed AugGPT approach over state-of-the-art text data augmentation methods in terms of testing accuracy and distribution of the augmented samples.
翻译:文本数据增强是克服众多自然语言处理(NLP)任务中样本量有限挑战的有效策略。这一挑战在少样本学习场景中尤为突出,此时目标领域的数据通常更为稀缺且质量较低。缓解此类挑战的一种常用且被广泛采纳的策略是进行数据增强,以更好地捕捉数据不变性并扩大样本量。然而,当前的文本数据增强方法要么无法确保生成数据的正确标注(缺乏保真性),要么无法保证生成数据具有足够的多样性(缺乏紧凑性),甚至两者兼而有之。受大型语言模型近期成功应用的启发,特别是展现出了增强语言理解能力的ChatGPT的发展,本研究提出了一种基于ChatGPT的文本数据增强方法(命名为AugGPT)。AugGPT将训练样本中的每个句子改写为多个概念上相似但语义上不同的样本。这些增强后的样本随后可用于下游模型训练。在少样本学习文本分类任务上的实验结果表明,所提出的AugGPT方法在测试准确率和增强样本的分布方面均优于最先进的文本数据增强方法。