Deep learning-based text classification models need abundant labeled data to obtain competitive performance. Unfortunately, annotating large-size corpus is time-consuming and laborious. To tackle this, multiple researches try to use data augmentation to expand the corpus size. However, data augmentation may potentially produce some noisy augmented samples. There are currently no works exploring sample selection for augmented samples in nature language processing field. In this paper, we propose a novel self-training selection framework with two selectors to select the high-quality samples from data augmentation. Specifically, we firstly use an entropy-based strategy and the model prediction to select augmented samples. Considering some samples with high quality at the above step may be wrongly filtered, we propose to recall them from two perspectives of word overlap and semantic similarity. Experimental results show the effectiveness and simplicity of our framework.
翻译:基于深度学习的文本分类模型需要大量标注数据才能获得具有竞争力的性能。然而,对大规模语料库进行标注既耗时又费力。为解决这一问题,多项研究尝试使用数据增强来扩大语料库规模。但数据增强可能产生一些带有噪声的增强样本。目前,自然语言处理领域尚无研究探索增强样本的样本选择方法。本文提出了一种新颖的自训练选择框架,该框架包含两个选择器,用于从数据增强中筛选高质量样本。具体而言,我们首先采用基于熵的策略和模型预测来挑选增强样本。考虑到上述步骤中部分高质量样本可能被错误过滤,我们提出从词重叠和语义相似度两个角度对它们进行召回。实验结果表明,我们的框架既有效又简便。