Text classification tasks often encounter few shot scenarios with limited labeled data, and addressing data scarcity is crucial. Data augmentation with mixup has shown to be effective on various text classification tasks. However, most of the mixup methods do not consider the varying degree of learning difficulty in different stages of training and generate new samples with one hot labels, resulting in the model over confidence. In this paper, we propose a self evolution learning (SE) based mixup approach for data augmentation in text classification, which can generate more adaptive and model friendly pesudo samples for the model training. SE focuses on the variation of the model's learning ability. To alleviate the model confidence, we introduce a novel instance specific label smoothing approach, which linearly interpolates the model's output and one hot labels of the original samples to generate new soft for label mixing up. Through experimental analysis, in addition to improving classification accuracy, we demonstrate that SE also enhances the model's generalize ability.
翻译:文本分类任务常面临标注数据有限的少样本场景,解决数据稀缺问题至关重要。基于混合(mixup)的数据增强方法已被证明能有效提升各类文本分类任务性能。然而,现有混合方法大多未考虑模型在不同训练阶段的学习难度变化,且生成的新样本采用独热标签,导致模型过度自信。本文提出一种基于自演化学习(SE)的混合数据增强方法,可生成更具适应性且与模型更友好的伪样本用于模型训练。SE聚焦于模型学习能力的动态变化,通过引入创新的实例级标签平滑策略,将模型输出与原始样本的独热标签进行线性插值,生成新的软标签用于标签混合,从而缓解模型过度自信问题。实验分析表明,SE不仅能提升分类准确率,还能增强模型的泛化能力。