Data Augmentation (DA) is frequently used to provide additional training data without extra human annotation automatically. However, data augmentation may introduce noisy data that impairs training. To guarantee the quality of augmented data, existing methods either assume no noise exists in the augmented data and adopt consistency training or use simple heuristics such as training loss and diversity constraints to filter out "noisy" data. However, those filtered examples may still contain useful information, and dropping them completely causes a loss of supervision signals. In this paper, based on the assumption that the original dataset is cleaner than the augmented data, we propose an on-the-fly denoising technique for data augmentation that learns from soft augmented labels provided by an organic teacher model trained on the cleaner original data. To further prevent overfitting on noisy labels, a simple self-regularization module is applied to force the model prediction to be consistent across two distinct dropouts. Our method can be applied to general augmentation techniques and consistently improve the performance on both text classification and question-answering tasks.
翻译:数据增强(DA)常用于自动提供额外训练数据而无需人工标注。然而,数据增强可能引入损害训练质量的噪声数据。为保证增强数据的质量,现有方法或假设增强数据中不存在噪声而采用一致性训练,或使用训练损失和多样性约束等简单启发式方法过滤"噪声"数据。但这些被过滤的样本仍可能包含有用信息,完全丢弃它们会导致监督信号的损失。基于原始数据集比增强数据更干净的假设,本文提出一种用于数据增强的在线去噪技术,该技术通过从基于更干净的原始数据训练的有机教师模型提供的软增强标签中学习。为进一步防止对噪声标签的过拟合,本文采用简单的自正则化模块,强制模型预测在两次不同的dropout条件下保持一致。该方法可应用于通用增强技术,并能持续提升文本分类和问答任务的性能。