This study conducts a thorough evaluation of text augmentation techniques across a variety of datasets and natural language processing (NLP) tasks to address the lack of reliable, generalized evidence for these methods. It examines the effectiveness of these techniques in augmenting training sets to improve performance in tasks such as topic classification, sentiment analysis, and offensive language detection. The research emphasizes not only the augmentation methods, but also the strategic order in which real and augmented instances are introduced during training. A major contribution is the development and evaluation of Modified Cyclical Curriculum Learning (MCCL) for augmented datasets, which represents a novel approach in the field. Results show that specific augmentation methods, especially when integrated with MCCL, significantly outperform traditional training approaches in NLP model performance. These results underscore the need for careful selection of augmentation techniques and sequencing strategies to optimize the balance between speed and quality improvement in various NLP tasks. The study concludes that the use of augmentation methods, especially in conjunction with MCCL, leads to improved results in various classification tasks, providing a foundation for future advances in text augmentation strategies in NLP.
翻译:本研究对多种数据集和自然语言处理(NLP)任务中的文本增强技术进行了全面评估,以解决这些方法缺乏可靠且泛化证据的问题。研究考察了这些技术在扩充训练集方面的有效性,旨在提升主题分类、情感分析和攻击性语言检测等任务的表现。研究不仅关注增强方法本身,还关注训练过程中真实样本与增强样本引入的策略性顺序。一项主要贡献是针对增强数据集开发并评估了改进型循环课程学习(MCCL),这代表了该领域的一种创新方法。结果表明,特定增强方法,尤其是与MCCL结合使用时,在NLP模型性能上显著优于传统训练方法。这些结果强调了需要精心选择增强技术和排序策略,以优化各类NLP任务中速度与质量提升之间的平衡。研究得出结论,增强方法的使用,特别是与MCCL相结合,能在多种分类任务中带来改进的结果,为未来NLP文本增强策略的发展奠定基础。