Data scarcity is a problem that occurs in languages and tasks where we do not have large amounts of labeled data but want to use state-of-the-art models. Such models are often deep learning models that require a significant amount of data to train. Acquiring data for various machine learning problems is accompanied by high labeling costs. Data augmentation is a low-cost approach for tackling data scarcity. This paper gives an overview of current state-of-the-art data augmentation methods used for natural language processing, with an emphasis on methods for neural and transformer-based models. Furthermore, it discusses the practical challenges of data augmentation, possible mitigations, and directions for future research.
翻译:数据稀缺是一个在缺乏大量标注数据的语言和任务中普遍存在的问题,但研究者仍希望使用最先进的模型来应对。这类模型通常是深度学习模型,需要大量数据进行训练。为各类机器学习问题获取数据往往伴随着高昂的标注成本。数据增强是一种低成本应对数据稀缺的方法。本文综述了当前自然语言处理领域最先进的数据增强方法,重点关注用于神经网络和基于Transformer模型的方法。此外,本文还讨论了数据增强的实际挑战、可能的缓解策略以及未来研究方向。