Emotion recognition in text, the task of identifying emotions such as joy or anger, is a challenging problem in NLP with many applications. One of the challenges is the shortage of available datasets that have been annotated with emotions. Certain existing datasets are small, follow different emotion taxonomies and display imbalance in their emotion distribution. In this work, we studied the impact of data augmentation techniques precisely when applied to small imbalanced datasets, for which current state-of-the-art models (such as RoBERTa) under-perform. Specifically, we utilized four data augmentation methods (Easy Data Augmentation EDA, static and contextual Embedding-based, and ProtAugment) on three datasets that come from different sources and vary in size, emotion categories and distributions. Our experimental results show that using the augmented data when training the classifier model leads to significant improvements. Finally, we conducted two case studies: a) directly using the popular chat-GPT API to paraphrase text using different prompts, and b) using external data to augment the training set. Results show the promising potential of these methods.
翻译:文本情绪识别(即识别如喜悦或愤怒等情绪的任务)是自然语言处理中具有诸多应用且颇具挑战性的问题。其中一个挑战在于缺少经情绪标注的可用数据集。现有部分数据集规模较小、遵循不同的情绪分类体系且情绪分布存在不平衡现象。本工作针对当前最先进模型(如RoBERTa)表现不佳的小规模不平衡数据集,研究了数据增强技术的影响。具体而言,我们在三个来源不同、规模及情绪类别与分布各异的数据集上采用了四种数据增强方法(简易数据增强EDA、静态与上下文词嵌入增强、ProtAugment)。实验结果表明,在训练分类模型时使用增强数据可带来显著性能提升。最后,我们进行了两项案例研究:a) 直接利用流行ChatGPT API通过不同提示对文本进行改写,b) 使用外部数据对训练集进行增强。结果显示了这些方法的巨大潜力。