Named entity recognition (NER) stands as a fundamental and pivotal task within the realm of Natural Language Processing. Particularly within the domain of Biomedical Method NER, this task presents notable challenges, stemming from the continual influx of domain-specific terminologies in scholarly literature. Current research in Biomedical Method (BioMethod) NER suffers from a scarcity of resources, primarily attributed to the intricate nature of methodological concepts, which necessitate a profound understanding for precise delineation. In this study, we propose a novel dataset for biomedical method entity recognition, employing an automated BioMethod entity recognition and information retrieval system to assist human annotation. Furthermore, we comprehensively explore a range of conventional and contemporary open-domain NER methodologies, including the utilization of cutting-edge large-scale language models (LLMs) customised to our dataset. Our empirical findings reveal that the large parameter counts of language models surprisingly inhibit the effective assimilation of entity extraction patterns pertaining to biomedical methods. Remarkably, the approach, leveraging the modestly sized ALBERT model (only 11MB), in conjunction with conditional random fields (CRF), achieves state-of-the-art (SOTA) performance.
翻译:命名实体识别(NER)是自然语言处理领域一项基础且关键的任务。特别是在生物医学方法命名实体识别领域,由于学术文献中不断涌现的领域特定术语,这项任务面临着显著挑战。当前生物医学方法命名实体识别的研究受限于资源匮乏,这主要归因于方法学概念的复杂性,需要深入理解才能进行精确界定。在本研究中,我们提出了一种用于生物医学方法实体识别的新型数据集,采用自动化的生物医学方法实体识别与信息检索系统来辅助人工标注。此外,我们全面探索了一系列传统和当代的开放域命名实体识别方法,包括使用针对我们数据集定制的尖端大规模语言模型。我们的实证研究结果表明,语言模型庞大的参数量意外地阻碍了其有效学习与生物医学方法相关的实体抽取模式。值得注意的是,利用参数量适中的ALBERT模型(仅11MB)并结合条件随机场的方法,取得了最先进的性能。