Despite recent advancements in Machine Learning, many tasks still involve working in low-data regimes which can make solving natural language problems difficult. Recently, a number of text augmentation techniques have emerged in the field of Natural Language Processing (NLP) which can enrich the training data with new examples, though they are not without their caveats. For instance, simple rule-based heuristic methods are effective, but lack variation in semantic content and syntactic structure with respect to the original text. On the other hand, more complex deep learning approaches can cause extreme shifts in the intrinsic meaning of the text and introduce unwanted noise into the training data. To more reliably control the quality of the augmented examples, we introduce a state-of-the-art approach for Self-Controlled Text Augmentation (STA). Our approach tightly controls the generation process by introducing a self-checking procedure to ensure that generated examples retain the semantic content of the original text. Experimental results on multiple benchmarking datasets demonstrate that STA substantially outperforms existing state-of-the-art techniques, whilst qualitative analysis reveals that the generated examples are both lexically diverse and semantically reliable.
翻译:尽管机器学习近期取得了进展,许多任务仍需在低数据量条件下进行,这给自然语言问题的求解带来了困难。近年来,自然语言处理领域涌现出多种文本增强技术,能够通过新增示例丰富训练数据,但这些方法也存在局限性。例如,基于规则的简单启发式方法虽有效,但与原始文本相比缺乏语义内容和句法结构的变化。另一方面,更复杂的深度学习方法可能导致文本内在含义的剧烈偏移,并为训练数据引入不必要的噪声。为更可靠地控制增强样本的质量,我们提出了一种前沿的自控文本增强方法。该方法通过引入自检流程来严格把控生成过程,确保生成的样本保留原始文本的语义内容。在多个基准数据集上的实验结果表明,STA方法显著优于现有前沿技术;同时,定性分析显示,生成的样本兼具词汇多样性和语义可靠性。