HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools

With the exponential growth of the life science literature, biomedical text mining (BTM) has become an essential technology for accelerating the extraction of insights from publications. Identifying named entities (e.g., diseases, drugs, or genes) in texts and their linkage to reference knowledge bases are crucial steps in BTM pipelines to enable information aggregation from different documents. However, tools for these two steps are rarely applied in the same context in which they were developed. Instead, they are applied in the wild, i.e., on application-dependent text collections different from those used for the tools' training, varying, e.g., in focus, genre, style, and text type. This raises the question of whether the reported performance of BTM tools can be trusted for downstream applications. Here, we report on the results of a carefully designed cross-corpus benchmark for named entity extraction, where tools were applied systematically to corpora not used during their training. Based on a survey of 28 published systems, we selected five for an in-depth analysis on three publicly available corpora encompassing four different entity types. Comparison between tools results in a mixed picture and shows that, in a cross-corpus setting, the performance is significantly lower than the one reported in an in-corpus setting. HunFlair2 showed the best performance on average, being closely followed by PubTator. Our results indicate that users of BTM tools should expect diminishing performances when applying them in the wild compared to original publications and show that further research is necessary to make BTM tools more robust.

翻译：随着生命科学文献的指数级增长，生物医学文本挖掘已成为加速从出版物中提取见解的关键技术。识别文本中的命名实体以及将其关联到参考知识库，是生物医学文本挖掘流程中的关键步骤，能够实现跨文档的信息聚合。然而，执行这两个步骤的工具很少在其开发环境中被应用，而是被应用于现实环境中——即取决于应用场景的文本集合，这些集合与工具训练所用的文本在关注点、体裁、风格及文本类型等方面均存在差异。这引发了质疑：生物医学文本挖掘工具所报告的性能是否能够可靠地应用于下游应用场景？本文报告了一项精心设计的跨语料库命名实体提取基准测试的结果，在该测试中，工具被系统地应用于其训练过程中未使用的语料库。基于对28个已发表系统的调研，我们选取其中5个工具，在涵盖四种不同实体类型的三个公开语料库上进行深入分析。工具间的比较结果呈现出复杂图景，表明在跨语料库场景下，性能显著低于同语料库场景下的报告值。HunFlair2平均表现最佳，PubTator紧随其后。我们的结果表明，与原始文献相比，生物医学文本挖掘工具的用户在现实环境中应用这些工具时，应预期其性能会有所下降，并显示需进一步研究以提升生物医学文本挖掘工具的鲁棒性。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【TPAMI2020】目标检测中的不平衡问题:综述论文，34页pdf

专知会员服务

55+阅读 · 2020年3月16日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日