Data augmentation is widely used in text classification, especially in the low-resource regime where a few examples for each class are available during training. Despite the success, generating data augmentations as hard positive examples that may increase their effectiveness is under-explored. This paper proposes an Adversarial Word Dilution (AWD) method that can generate hard positive examples as text data augmentations to train the low-resource text classification model efficiently. Our idea of augmenting the text data is to dilute the embedding of strong positive words by weighted mixing with unknown-word embedding, making the augmented inputs hard to be recognized as positive by the classification model. We adversarially learn the dilution weights through a constrained min-max optimization process with the guidance of the labels. Empirical studies on three benchmark datasets show that AWD can generate more effective data augmentations and outperform the state-of-the-art text data augmentation methods. The additional analysis demonstrates that the data augmentations generated by AWD are interpretable and can flexibly extend to new examples without further training.
翻译:数据增强在文本分类中广泛应用,尤其是在低资源场景下(即训练时每个类别仅有少量样本可用)。尽管取得了成功,但生成能提升模型效果的困难正例数据增强方法尚待深入探索。本文提出一种敌对词稀释(Adversarial Word Dilution, AWD)方法,可生成作为文本数据增强的困难正例,从而高效训练低资源文本分类模型。我们的文本数据增强思路是:通过将强积极词嵌入与未知词嵌入进行加权混合来稀释其表征,使增强后的输入难以被分类模型识别为积极样本。我们通过约束极小极大优化过程,在标签引导下对抗性地学习稀释权重。在三个基准数据集上的实证研究表明,AWD能生成更有效的数据增强样本,并优于当前最先进的文本数据增强方法。进一步分析表明,AWD生成的数据增强具有可解释性,且无需额外训练即可灵活扩展至新样本。