Biomedical named entity recognition (NER) is a critial task that aims to identify structured information in clinical text, which is often replete with complex, technical terms and a high degree of variability. Accurate and reliable NER can facilitate the extraction and analysis of important biomedical information, which can be used to improve downstream applications including the healthcare system. However, NER in the biomedical domain is challenging due to limited data availability, as the high expertise, time, and expenses are required to annotate its data. In this paper, by using the limited data, we explore various extrinsic factors including the corpus annotation scheme, data augmentation techniques, semi-supervised learning and Brill transformation, to improve the performance of a NER model on a clinical text dataset (i2b2 2012, \citet{sun-rumshisky-uzuner:2013}). Our experiments demonstrate that these approaches can significantly improve the model's F1 score from original 73.74 to 77.55. Our findings suggest that considering different extrinsic factors and combining these techniques is a promising approach for improving NER performance in the biomedical domain where the size of data is limited.
翻译:生物医学命名实体识别(NER)是一项关键任务,旨在识别临床文本中结构化的信息,这些文本通常包含复杂的技术术语且具有高度变异性。准确可靠的NER能够促进重要生物医学信息的提取与分析,进而用于改善包括医疗系统在内的下游应用。然而,由于生物医学领域数据标注需要高度专业知识、时间与高昂成本,导致数据可用性受限,使得该领域的NER任务面临挑战。本文利用有限数据,探索了包括语料标注方案、数据增强技术、半监督学习及Brill变换在内的多种外部因素,以提升临床文本数据集(i2b2 2012,\citet{sun-rumshisky-uzuner:2013})上NER模型的性能。实验表明,这些方法可将模型的F1分数从原始的73.74显著提升至77.55。我们的研究结果表明,在数据规模有限的生物医学领域,综合考虑不同外部因素并融合相关技术,是提升NER性能的一种有前景的途径。