Multi-level biomedical NER through multi-granularity embeddings and enhanced labeling

Biomedical Named Entity Recognition (NER) is a fundamental task of Biomedical Natural Language Processing for extracting relevant information from biomedical texts, such as clinical records, scientific publications, and electronic health records. The conventional approaches for biomedical NER mainly use traditional machine learning techniques, such as Conditional Random Fields and Support Vector Machines or deep learning-based models like Recurrent Neural Networks and Convolutional Neural Networks. Recently, Transformer-based models, including BERT, have been used in the domain of biomedical NER and have demonstrated remarkable results. However, these models are often based on word-level embeddings, limiting their ability to capture character-level information, which is effective in biomedical NER due to the high variability and complexity of biomedical texts. To address these limitations, this paper proposes a hybrid approach that integrates the strengths of multiple models. In this paper, we proposed an approach that leverages fine-tuned BERT to provide contextualized word embeddings, a pre-trained multi-channel CNN for character-level information capture, and following by a BiLSTM + CRF for sequence labelling and modelling dependencies between the words in the text. In addition, also we propose an enhanced labelling method as part of pre-processing to enhance the identification of the entity's beginning word and thus improve the identification of multi-word entities, a common challenge in biomedical NER. By integrating these models and the pre-processing method, our proposed model effectively captures both contextual information and detailed character-level information. We evaluated our model on the benchmark i2b2/2010 dataset, achieving an F1-score of 90.11. These results illustrate the proficiency of our proposed model in performing biomedical Named Entity Recognition.

翻译：生物医学命名实体识别（Biomedical Named Entity Recognition, NER）是生物医学自然语言处理中的基础任务，旨在从临床记录、科学出版物及电子健康记录等生物医学文本中提取相关信息。传统生物医学NER方法主要采用条件随机场（CRF）和支持向量机（SVM）等传统机器学习技术，或循环神经网络（RNN）和卷积神经网络（CNN）等深度学习模型。近年来，基于Transformer的模型（如BERT）已被应用于生物医学NER领域，并展现出显著成效。然而，这些模型通常依赖词级嵌入，限制了其捕获字符级信息的能力——由于生物医学文本具有高度变异性和复杂性，字符级信息在该领域NER中尤为有效。为解决上述局限，本文提出一种融合多模型优势的混合方法。具体而言，我们提出的方法利用微调后的BERT提供上下文词嵌入，通过预训练的多通道CNN捕获字符级信息，并采用BiLSTM+CRF进行序列标注及文本词汇间依赖关系建模。此外，我们还提出一种增强标注法作为预处理环节，以强化实体起始词的识别能力，从而改善生物医学NER中普遍面临的多词实体识别难题。通过整合上述模型与预处理方法，本文模型有效兼顾了上下文信息与细粒度字符级信息。我们在基准数据集i2b2/2010上进行了评估，取得了90.11的F1分数。这些结果充分验证了所提模型在生物医学命名实体识别任务中的卓越性能。