Named entity recognition (NER) aims to identify mentions of named entities in an unstructured text and classify them into predefined named entity classes. While deep learning-based pre-trained language models help to achieve good predictive performances in NER, many domain-specific NER applications still call for a substantial amount of labeled data. Active learning (AL), a general framework for the label acquisition problem, has been used for NER tasks to minimize the annotation cost without sacrificing model performance. However, the heavily imbalanced class distribution of tokens introduces challenges in designing effective AL querying methods for NER. We propose several AL sentence query evaluation functions that pay more attention to potential positive tokens, and evaluate these proposed functions with both sentence-based and token-based cost evaluation strategies. We also propose a better data-driven normalization approach to penalize sentences that are too long or too short. Our experiments on three datasets from different domains reveal that the proposed approach reduces the number of annotated tokens while achieving better or comparable prediction performance with conventional methods.
翻译:命名实体识别(NER)旨在从非结构化文本中识别出命名实体的提及,并将其归类到预定义的命名实体类别中。尽管基于深度学习的预训练语言模型有助于在NER任务中获得良好的预测性能,但许多领域特定的NER应用仍然需要大量标注数据。主动学习(AL)作为一种通用的标记获取问题框架,已被用于NER任务,以在不牺牲模型性能的情况下最小化标注成本。然而,标记的极度不平衡类别分布给NER设计有效的AL查询方法带来了挑战。我们提出了几种更关注潜在正向标记的AL句子查询评估函数,并基于句子级和标记级成本评估策略对这些函数进行了评估。同时,我们提出了一种更优的数据驱动归一化方法,用于惩罚过长或过短的句子。我们在三个不同领域数据集上的实验表明,所提出的方法在达到与传统方法相当或更优预测性能的同时,减少了需要标注的标记数量。