Text data augmentation is a complex problem due to the discrete nature of sentences. Although rule-based augmentation methods are widely adopted in real-world applications because of their simplicity, they suffer from potential semantic damage. Previous researchers have suggested easy data augmentation with soft labels (softEDA), employing label smoothing to mitigate this problem. However, finding the best factor for each model and dataset is challenging; therefore, using softEDA in real-world applications is still difficult. In this paper, we propose adapting AutoAugment to solve this problem. The experimental results suggest that the proposed method can boost existing augmentation methods and that rule-based methods can enhance cutting-edge pre-trained language models. We offer the source code.
翻译:文本数据增强因语句的离散特性而具有复杂性。尽管基于规则的增强方法因其简便性在现实应用中被广泛采用,但其存在潜在的语义损伤问题。以往研究者提出采用软标签的简易数据增强(softEDA)方法,通过标签平滑技术缓解该问题,然而为不同模型和数据集寻找最优平滑因子极具挑战性,因此softEDA在现实应用中的使用仍存在困难。本文提出采用AutoAugment策略解决这一难题。实验结果表明,本方法能有效提升现有数据增强方法的性能,且基于规则的增强方法可增强前沿预训练语言模型的表现。我们公开了相关源代码。