Data augmentation has been widely used in low-resource NER tasks to tackle the problem of data sparsity. However, previous data augmentation methods have the disadvantages of disrupted syntactic structures, token-label mismatch, and requirement for external knowledge or manual effort. To address these issues, we propose Robust Prompt-based Data Augmentation (RoPDA) for low-resource NER. Based on pre-trained language models (PLMs) with continuous prompt, RoPDA performs entity augmentation and context augmentation through five fundamental augmentation operations to generate label-flipping and label-preserving examples. To optimize the utilization of the augmented samples, we present two techniques: Self-Consistency Filtering and mixup. The former effectively eliminates low-quality samples, while the latter prevents performance degradation arising from the direct utilization of label-flipping samples. Extensive experiments on three benchmarks from different domains demonstrate that RoPDA significantly improves upon strong baselines, and also outperforms state-of-the-art semi-supervised learning methods when unlabeled data is included.
翻译:数据增强已广泛用于低资源命名实体识别任务中,以解决数据稀疏性问题。然而,以往的数据增强方法存在句法结构破坏、标签-标记不匹配以及依赖外部知识或人工干预等缺陷。针对这些问题,我们提出了面向低资源NER的鲁棒提示基数据增强方法(RoPDA)。基于带有连续提示的预训练语言模型(PLM),RoPDA通过五种基本增强操作执行实体增强与上下文增强,生成标签翻转型和标签保持型样本。为优化增强样本的利用效率,我们提出了两种技术:自一致性过滤和混合增强。前者可有效剔除低质量样本,后者则能防止直接使用标签翻转样本导致的性能下降。在来自不同领域的三个基准数据集上的大量实验表明,RoPDA显著优于强基线方法,当引入未标注数据时,其性能甚至超越当前最先进的半监督学习方法。