Existing models for named entity recognition (NER) are mainly based on large-scale labeled datasets, which always obtain using crowdsourcing. However, it is hard to obtain a unified and correct label via majority voting from multiple annotators for NER due to the large labeling space and complexity of this task. To address this problem, we aim to utilize the original multi-annotator labels directly. Particularly, we propose a Confidence-based Partial Label Learning (CPLL) method to integrate the prior confidence (given by annotators) and posterior confidences (learned by models) for crowd-annotated NER. This model learns a token- and content-dependent confidence via an Expectation-Maximization (EM) algorithm by minimizing empirical risk. The true posterior estimator and confidence estimator perform iteratively to update the true posterior and confidence respectively. We conduct extensive experimental results on both real-world and synthetic datasets, which show that our model can improve performance effectively compared with strong baselines.
翻译:现有命名实体识别(NER)模型主要依赖大规模标注数据集,这些数据集通常通过众包方式获取。然而,由于NER任务的标注空间庞大且任务本身复杂,通过多数投票法从多个标注者处获取统一且正确的标签存在困难。为解决该问题,我们直接利用原始的多标注者标签。具体而言,我们提出一种基于置信度的部分标签学习(CPLL)方法,将先验置信度(由标注者提供)与后验置信度(由模型学习)相结合,用于众包标注的NER任务。该模型通过最小化经验风险,利用期望最大化(EM)算法学习与令牌及内容相关的置信度。真实后验估计器与置信度估计器通过迭代方式分别更新真实后验概率和置信度。我们在真实数据集与合成数据集上开展了广泛实验,结果表明,相较于强基线模型,我们的模型能够有效提升性能。