Recently, data augmentation (DA) methods have been proven to be effective for pre-trained language models (PLMs) in low-resource settings, including few-shot named entity recognition (NER). However, conventional NER DA methods are mostly aimed at sequence labeling models, i.e., token-level classification, and few are compatible with unified autoregressive generation frameworks, which can handle a wider range of NER tasks, such as nested NER. Furthermore, these generation frameworks have a strong assumption that the entities will appear in the target sequence with the same left-to-right order as the source sequence. In this paper, we claim that there is no need to keep this strict order, and more diversified but reasonable target entity sequences can be provided during the training stage as a novel DA method. Nevertheless, a naive mixture of augmented data can confuse the model since one source sequence will then be paired with different target sequences. Therefore, we propose a simple but effective Prompt Ordering based Data Augmentation (PODA) method to improve the training of unified autoregressive generation frameworks under few-shot NER scenarios. Experimental results on three public NER datasets and further analyses demonstrate the effectiveness of our approach.
翻译:最近,数据增强方法已被证明在低资源场景下对预训练语言模型有效,包括小样本命名实体识别。然而,传统的NER数据增强方法大多针对序列标注模型(即词元级分类),而极少与统一的自动回归生成框架兼容,后者可处理更广泛的NER任务(如嵌套NER)。此外,这些生成框架有一个强假设:实体将以与源序列相同的从左到右顺序出现在目标序列中。本文提出,无需保持这种严格顺序,可在训练阶段提供更多样化但合理的目标实体序列,作为一种新型数据增强方法。然而,增强数据的简单混合可能混淆模型,因为一个源序列将对应多个目标序列。因此,我们提出一种简单但有效的基于提示排序的数据增强方法,以改进统一自动回归生成框架在小样本NER场景下的训练。在三个公开NER数据集上的实验结果及进一步分析证明了我们方法的有效性。