Biomedical named entity recognition is one of the core tasks in biomedical natural language processing (BioNLP). To tackle this task, numerous supervised/distantly supervised approaches have been proposed. Despite their remarkable success, these approaches inescapably demand laborious human effort. To alleviate the need of human effort, dictionary-based approaches have been proposed to extract named entities simply based on a given dictionary. However, one downside of existing dictionary-based approaches is that they are challenged to identify concept synonyms that are not listed in the given dictionary, which we refer as the synonym generalization problem. In this study, we propose a novel Synonym Generalization (SynGen) framework that recognizes the biomedical concepts contained in the input text using span-based predictions. In particular, SynGen introduces two regularization terms, namely, (1) a synonym distance regularizer; and (2) a noise perturbation regularizer, to minimize the synonym generalization error. To demonstrate the effectiveness of our approach, we provide a theoretical analysis of the bound of synonym generalization error. We extensively evaluate our approach on a wide range of benchmarks and the results verify that SynGen outperforms previous dictionary-based models by notable margins. Lastly, we provide a detailed analysis to further reveal the merits and inner-workings of our approach.
翻译:生物医学命名实体识别是生物医学自然语言处理(BioNLP)的核心任务之一。针对该任务,研究者提出了大量有监督/弱监督方法。尽管这些方法取得了显著成功,但不可避免地需要耗费大量人力。为减轻人力需求,基于词典的方法被提出,仅依据给定词典即可提取命名实体。然而,现有基于词典的方法存在一个缺陷:它们难以识别未在词典中列出的概念同义词,我们称之为同义词泛化问题。本研究提出一种新颖的同义词泛化(SynGen)框架,通过基于跨度的预测识别输入文本中的生物医学概念。具体而言,SynGen引入了两个正则化项:(1)同义词距离正则化器;(2)噪声扰动正则化器,以最小化同义词泛化误差。为展示方法的有效性,我们提供了同义词泛化误差边界的理论分析。我们在多个基准数据集上进行了全面评估,结果验证了SynGen显著优于以往的基于词典的模型。最后,我们通过详细分析进一步揭示了该方法的优势与内在机制。