Distantly supervised named entity recognition (DS-NER) aims to locate entity mentions and classify their types with only knowledge bases or gazetteers and unlabeled corpus. However, distant annotations are noisy and degrade the performance of NER models. In this paper, we propose a noise-robust prototype network named MProto for the DS-NER task. Different from previous prototype-based NER methods, MProto represents each entity type with multiple prototypes to characterize the intra-class variance among entity representations. To optimize the classifier, each token should be assigned an appropriate ground-truth prototype and we consider such token-prototype assignment as an optimal transport (OT) problem. Furthermore, to mitigate the noise from incomplete labeling, we propose a novel denoised optimal transport (DOT) algorithm. Specifically, we utilize the assignment result between Other class tokens and all prototypes to distinguish unlabeled entity tokens from true negatives. Experiments on several DS-NER benchmarks demonstrate that our MProto achieves state-of-the-art performance. The source code is now available on Github.
翻译:远程监督命名实体识别(DS-NER)旨在仅利用知识库或词典及未标注语料库定位实体提及并分类其类型。然而,远程标注存在噪声,会降低NER模型的性能。本文针对DS-NER任务提出了一种名为MProto的鲁棒原型网络。与以往基于原型的NER方法不同,MProto通过多个原型表示每个实体类型,以刻画实体表示中的类内差异。为了优化分类器,每个令牌应被分配一个合适的真实原型,我们将这种令牌-原型分配视为最优传输(OT)问题。此外,为缓解不完全标注带来的噪声,我们提出了一种新颖的去噪最优传输(DOT)算法。具体而言,我们利用Other类令牌与所有原型之间的分配结果,将未标注的实体令牌与真实负例区分开来。在多个DS-NER基准数据集上的实验表明,我们的MProto达到了最先进的性能。源代码现已发布于Github。