KnowDA: All-in-One Knowledge Mixture Model for Data Augmentation in Low-Resource NLP

This paper focuses on the data augmentation for low-resource NLP tasks where the training set is limited. The existing solutions either leverage task-independent heuristic rules (e.g., Synonym Replacement) or fine-tune general-purpose pre-trained language models (e.g., GPT2) using the limited training instances to produce new synthetic data. Consequently, they have trivial task-specific knowledge and are limited to yielding low-quality synthetic data. To combat this issue, we propose Knowledge Mixture Data Augmentation Model (KnowDA) which is an Seq2Seq language model pre-trained on a mixture of diverse NLP tasks under a novel framework of Knowledge Mixture Training (KoMT). The goal of KoMT is to condense diverse NLP task-specific knowledge into the single KnowDA model (i.e., all-in-one) such that KnowDA could utilize these knowledge to quickly grasp the inherent synthesis law of the target task through limited training instances. Specifically, KoMT reformulates input examples from various heterogeneous NLP tasks into a unified text-to-text format, and employs denoising training objectives in different granularity to learn to reconstruct partial or complete samples. To the best of our knowledge, we are the first attempt to apply 100+ NLP multi-task training for data augmentation. Extensive experiments show that i) the synthetic data produced by KnowDA successfully improves performance of the strong pre-trained language models (i.e., Bert, ALBert and Deberta) by a large margin on the low-resource NLP benchmark FewGLUE, CoNLL'03 and WikiAnn; ii) KnowDA successfully transfers the task knowledge to NLP tasks whose types are seen and unseen in KoMT.

翻译：本文聚焦于低资源自然语言处理任务（训练集有限）的数据增强问题。现有解决方案要么采用任务无关的启发式规则（如同义词替换），要么利用有限的训练实例微调通用预训练语言模型（如GPT2）以生成新的合成数据。此类方法缺乏任务特定知识，仅能生成低质量合成数据。为解决该问题，我们提出知识混合数据增强模型（KnowDA），这是一个在知识混合训练（KoMT）框架下对多种自然语言处理任务混合数据预训练的序列到序列语言模型。KoMT的目标是将多样化的自然语言处理任务特定知识浓缩至单个KnowDA模型（即一体化模型），使其能够利用这些知识，通过有限训练实例快速掌握目标任务的内在合成规律。具体而言，KoMT将来自异构自然语言处理任务的输入样本统一转化为文本到文本格式，并采用不同粒度的去噪训练目标，学习重建部分或完整样本。据我们所知，这是首次将100余种自然语言处理多任务训练应用于数据增强。大量实验表明：（i）KnowDA生成的合成数据显著提升了强预训练语言模型（如BERT、ALBERT和DeBERTa）在低资源自然语言处理基准FewGLUE、CoNLL'03及WikiAnn上的性能；（ii）KnowDA成功将任务知识迁移至KoMT中已见与未见类型的自然语言处理任务。