Data scarcity and data imbalance have attracted a lot of attention in many fields. Data augmentation, explored as an effective approach to tackle them, can improve the robustness and efficiency of classification models by generating new samples. This paper presents REPRINT, a simple and effective hidden-space data augmentation method for imbalanced data classification. Given hidden-space representations of samples in each class, REPRINT extrapolates, in a randomized fashion, augmented examples for target class by using subspaces spanned by principal components to summarize distribution structure of both source and target class. Consequently, the examples generated would diversify the target while maintaining the original geometry of target distribution. Besides, this method involves a label refinement component which allows to synthesize new soft labels for augmented examples. Compared with different NLP data augmentation approaches under a range of data imbalanced scenarios on four text classification benchmark, REPRINT shows prominent improvements. Moreover, through comprehensive ablation studies, we show that label refinement is better than label-preserving for augmented examples, and that our method suggests stable and consistent improvements in terms of suitable choices of principal components. Moreover, REPRINT is appealing for its easy-to-use since it contains only one hyperparameter determining the dimension of subspace and requires low computational resource.
翻译:数据稀缺与数据不平衡问题已在诸多领域引起广泛关注。数据增强作为应对这些问题的有效途径,通过生成新样本能够提升分类模型的鲁棒性与效率。本文提出REPRINT——一种针对不平衡数据分类的简单高效隐空间数据增强方法。给定各类样本的隐空间表示,REPRINT通过主成分张成的子空间来概括源类与目标类的分布结构,以随机方式为目标类外推增强样本。由此生成的样本能在保持目标类原始分布几何结构的同时,有效提升其多样性。此外,本方法包含标签细化组件,可为增强样本合成新的软标签。通过在四个文本分类基准数据集上、多种数据不平衡场景下与不同NLP数据增强方法进行比较,REPRINT展现出显著优势。进一步通过系统性的消融实验,我们证明对增强样本采用标签细化策略优于标签保留策略,且该方法在主成分的合理选择范围内均能保持稳定一致的性能提升。值得注意的是,REPRINT仅包含一个用于确定子空间维度的超参数,且计算资源需求较低,因而具备良好的易用性。