ACLM: A Selective-Denoising based Generative Data Augmentation Approach for Low-Resource Complex NER

Complex Named Entity Recognition (NER) is the task of detecting linguistically complex named entities in low-context text. In this paper, we present ACLM Attention-map aware keyword selection for Conditional Language Model fine-tuning), a novel data augmentation approach based on conditional generation to address the data scarcity problem in low-resource complex NER. ACLM alleviates the context-entity mismatch issue, a problem existing NER data augmentation techniques suffer from and often generates incoherent augmentations by placing complex named entities in the wrong context. ACLM builds on BART and is optimized on a novel text reconstruction or denoising task - we use selective masking (aided by attention maps) to retain the named entities and certain keywords in the input sentence that provide contextually relevant additional knowledge or hints about the named entities. Compared with other data augmentation strategies, ACLM can generate more diverse and coherent augmentations preserving the true word sense of complex entities in the sentence. We demonstrate the effectiveness of ACLM both qualitatively and quantitatively on monolingual, cross-lingual, and multilingual complex NER across various low-resource settings. ACLM outperforms all our neural baselines by a significant margin (1%-36%). In addition, we demonstrate the application of ACLM to other domains that suffer from data scarcity (e.g., biomedical). In practice, ACLM generates more effective and factual augmentations for these domains than prior methods. Code: https://github.com/Sreyan88/ACLM

翻译：复杂命名实体识别（Complex NER）是一项在低上下文文本中检测语言复杂命名实体的任务。本文提出ACLM（注意力图感知关键词选择的条件语言模型微调），一种基于条件生成的新型数据增强方法，以解决低资源复杂NER中的数据稀缺问题。ACLM缓解了上下文-实体不匹配问题，这是现有NER数据增强技术普遍存在的缺陷——此类技术常因将复杂命名实体置于错误上下文而生成不连贯的增强数据。ACLM基于BART构建，并在一种新颖的文本重构或去噪任务上优化——我们利用选择性掩码（借助注意力图）保留输入句子中的命名实体及特定关键词，这些关键词能为实体提供上下文相关的额外知识或线索。与其他数据增强策略相比，ACLM能生成更多样化且连贯的增强数据，同时保留句子中复杂实体的真实词义。我们通过单语、跨语言和多语言复杂NER在多种低资源场景下，定性和定量地验证了ACLM的有效性。ACLM显著超越所有神经基线模型（1%-36%）。此外，我们还展示了ACLM在其他数据稀缺领域（如生物医学）的应用。在实践中，ACLM在这些领域生成的增强数据比先前方法更有效且更符合事实。代码：https://github.com/Sreyan88/ACLM