We present ABEX, a novel and effective generative data augmentation methodology for low-resource Natural Language Understanding (NLU) tasks. ABEX is based on ABstract-and-EXpand, a novel paradigm for generating diverse forms of an input document -- we first convert a document into its concise, abstract description and then generate new documents based on expanding the resultant abstraction. To learn the task of expanding abstract descriptions, we first train BART on a large-scale synthetic dataset with abstract-document pairs. Next, to generate abstract descriptions for a document, we propose a simple, controllable, and training-free method based on editing AMR graphs. ABEX brings the best of both worlds: by expanding from abstract representations, it preserves the original semantic properties of the documents, like style and meaning, thereby maintaining alignment with the original label and data distribution. At the same time, the fundamental process of elaborating on abstract descriptions facilitates diverse generations. We demonstrate the effectiveness of ABEX on 4 NLU tasks spanning 12 datasets and 4 low-resource settings. ABEX outperforms all our baselines qualitatively with improvements of 0.04% - 38.8%. Qualitatively, ABEX outperforms all prior methods from literature in terms of context and length diversity.
翻译:本文提出ABEX,一种针对低资源自然语言理解任务的新型高效生成式数据增强方法。ABEX基于"抽象-扩展"新范式:首先将输入文档转化为简洁的抽象描述,随后通过扩展该抽象表示生成新文档。为学习抽象描述扩展任务,我们首先在大规模合成的抽象-文档配对数据集上训练BART模型。接着,为生成文档的抽象描述,我们提出一种基于编辑抽象语义表示图的简单、可控且无需训练的方法。ABEX融合双重优势:通过从抽象表示扩展生成,能保持文档的原始语义属性(如风格和含义),从而维持与原始标签及数据分布的一致性;同时,基于抽象描述进行细化的生成过程能促进多样性表达。我们在涵盖12个数据集、4种低资源场景的4项NLU任务上验证了ABEX的有效性。定量分析显示,ABEX以0.04%-38.8%的性能提升优于所有基线模型;定性评估表明,ABEX在上下文多样性和文本长度多样性方面均超越现有文献方法。