Text augmentation is a technique for constructing synthetic data from an under-resourced corpus to improve predictive performance. Synthetic data generation is common in numerous domains. However, recently text augmentation has emerged in natural language processing (NLP) to improve downstream tasks. One of the current state-of-the-art text augmentation techniques is easy data augmentation (EDA), which augments the training data by injecting and replacing synonyms and randomly permuting sentences. One major obstacle with EDA is the need for versatile and complete synonym dictionaries, which cannot be easily found in low-resource languages. To improve the utility of EDA, we propose two extensions, easy distributional data augmentation (EDDA) and type specific similar word replacement (TSSR), which uses semantic word context information and part-of-speech tags for word replacement and augmentation. In an extensive empirical evaluation, we show the utility of the proposed methods, measured by F1 score, on two representative datasets in Swedish as an example of a low-resource language. With the proposed methods, we show that augmented data improve classification performances in low-resource settings.
翻译:文本增强是一种从资源匮乏的语料库中构建合成数据以提升预测性能的技术。合成数据生成在众多领域已十分普遍。然而,近年来文本增强技术开始应用于自然语言处理领域,以改进下游任务。当前最先进的文本增强技术之一是简易数据增强方法,该方法通过注入和替换同义词以及随机排列语句来扩充训练数据。简易数据增强的一个主要障碍在于需要功能完备且全面的同义词词典,而这在低资源语言中难以获得。为提升简易数据增强的实用性,我们提出了两种扩展方法:简易分布式数据增强和类型特定相似词替换方法,前者利用语义词上下文信息,后者通过词性标签进行词汇替换与数据增强。通过广泛的经验评估,我们以瑞典语作为低资源语言的代表,在两组典型数据集上基于F1分数验证了所提方法的实用性。实验表明,采用所提方法生成的增强数据能够有效提升低资源场景下的分类性能。