Most weakly supervised named entity recognition (NER) models rely on domain-specific dictionaries provided by experts. This approach is infeasible in many domains where dictionaries do not exist. While a phrase retrieval model was used to construct pseudo-dictionaries with entities retrieved from Wikipedia automatically in a recent study, these dictionaries often have limited coverage because the retriever is likely to retrieve popular entities rather than rare ones. In this study, we present a novel framework, HighGEN, that generates NER datasets with high-coverage pseudo-dictionaries. Specifically, we create entity-rich dictionaries with a novel search method, called phrase embedding search, which encourages the retriever to search a space densely populated with various entities. In addition, we use a new verification process based on the embedding distance between candidate entity mentions and entity types to reduce the false-positive noise in weak labels generated by high-coverage dictionaries. We demonstrate that HighGEN outperforms the previous best model by an average F1 score of 4.7 across five NER benchmark datasets.
翻译:大多数弱监督命名实体识别(NER)模型依赖于专家提供的领域专用词典。在许多缺少此类词典的领域中,该方法不可行。尽管近期研究使用短语检索模型从维基百科自动检索实体以构建伪词典,但这些词典往往覆盖范围有限,因为检索模型更可能检索到知名实体而非稀有实体。本研究提出一种新型框架HighGEN,通过高覆盖率的伪词典生成NER数据集。具体而言,我们设计了一种名为短语嵌入搜索(phrase embedding search)的新型搜索方法,通过鼓励检索模型在密集分布各类实体的空间中进行搜索,从而创建包含丰富实体的词典。此外,我们基于候选实体提及与实体类型之间的嵌入距离引入了一种新验证流程,以减少高覆盖率词典生成的弱标签中的假阳性噪声。实验表明,在五个NER基准数据集上,HighGEN的平均F1分数较先前最优模型高出4.7个百分点。