Biomedical Named Entity Recognition (BioNER) is the fundamental task of identifying named entities from biomedical text. However, BioNER suffers from severe data scarcity and lacks high-quality labeled data due to the highly specialized and expert knowledge required for annotation. Though data augmentation has shown to be highly effective for low-resource NER in general, existing data augmentation techniques fail to produce factual and diverse augmentations for BioNER. In this paper, we present BioAug, a novel data augmentation framework for low-resource BioNER. BioAug, built on BART, is trained to solve a novel text reconstruction task based on selective masking and knowledge augmentation. Post training, we perform conditional generation and generate diverse augmentations conditioning BioAug on selectively corrupted text similar to the training stage. We demonstrate the effectiveness of BioAug on 5 benchmark BioNER datasets and show that BioAug outperforms all our baselines by a significant margin (1.5%-21.5% absolute improvement) and is able to generate augmentations that are both more factual and diverse. Code: https://github.com/Sreyan88/BioAug.
翻译:生物医学命名实体识别(BioNER)是从生物医学文本中识别命名实体的基础性任务。然而,由于标注需要高度专业化的专家知识,BioNER面临严重的数据稀缺问题,缺乏高质量标注数据。尽管数据增强在通用低资源命名实体识别任务中表现出高效性,但现有数据增强技术无法为BioNER生成事实准确且多样化的增强数据。本文提出BioAug,一种针对低资源BioNER的新型数据增强框架。BioAug基于BART构建,通过选择性掩码和知识增强完成一项新型文本重建任务的训练。训练后,我们执行条件生成,让BioAug基于与训练阶段类似的、经选择性破坏的文本生成多样化增强数据。我们在5个基准BioNER数据集上验证了BioAug的有效性,结果表明BioAug显著优于所有基线方法(绝对提升1.5%-21.5%),并能生成既更准确又更多样化的增强数据。代码:https://github.com/Sreyan88/BioAug。