Named entity recognition (NER) aims to identify mentions of named entities in an unstructured text and classify them into predefined named entity classes. While deep learning-based pre-trained language models help to achieve good predictive performances in NER, many domain-specific NER applications still call for a substantial amount of labeled data. Active learning (AL), a general framework for the label acquisition problem, has been used for NER tasks to minimize the annotation cost without sacrificing model performance. However, the heavily imbalanced class distribution of tokens introduces challenges in designing effective AL querying methods for NER. We propose several AL sentence query evaluation functions that pay more attention to potential positive tokens, and evaluate these proposed functions with both sentence-based and token-based cost evaluation strategies. We also propose a better data-driven normalization approach to penalize sentences that are too long or too short. Our experiments on three datasets from different domains reveal that the proposed approach reduces the number of annotated tokens while achieving better or comparable prediction performance with conventional methods.
翻译:命名实体识别(NER)旨在从非结构化文本中识别命名实体的提及,并将其分类至预定义的命名实体类别。尽管基于深度学习的预训练语言模型有助于在NER中取得良好的预测性能,但许多特定领域的NER应用仍需要大量标注数据。主动学习(AL)作为一种解决标签获取问题的通用框架,已被用于NER任务,以在不牺牲模型性能的前提下最小化标注成本。然而,令牌类别分布的高度不平衡给设计有效的NER主动学习查询方法带来了挑战。本文提出了若干更关注潜在正例令牌的句子级查询评估函数,并基于句子级和令牌级成本评估策略对这些函数进行了验证。此外,我们提出了一种更优的数据驱动归一化方法,以惩罚过长或过短的句子。在三个不同领域的数据集上的实验表明,与常规方法相比,所提方法在减少标注令牌数量的同时取得了相当或更优的预测性能。