With the exponential growth of the life science literature, biomedical text mining (BTM) has become an essential technology for accelerating the extraction of insights from publications. Identifying named entities (e.g., diseases, drugs, or genes) in texts and their linkage to reference knowledge bases are crucial steps in BTM pipelines to enable information aggregation from different documents. However, tools for these two steps are rarely applied in the same context in which they were developed. Instead, they are applied in the wild, i.e., on application-dependent text collections different from those used for the tools' training, varying, e.g., in focus, genre, style, and text type. This raises the question of whether the reported performance of BTM tools can be trusted for downstream applications. Here, we report on the results of a carefully designed cross-corpus benchmark for named entity extraction, where tools were applied systematically to corpora not used during their training. Based on a survey of 28 published systems, we selected five for an in-depth analysis on three publicly available corpora encompassing four different entity types. Comparison between tools results in a mixed picture and shows that, in a cross-corpus setting, the performance is significantly lower than the one reported in an in-corpus setting. HunFlair2 showed the best performance on average, being closely followed by PubTator. Our results indicate that users of BTM tools should expect diminishing performances when applying them in the wild compared to original publications and show that further research is necessary to make BTM tools more robust.
翻译:随着生命科学文献的指数级增长,生物医学文本挖掘已成为加速从出版物中提取见解的关键技术。识别文本中的命名实体以及将其关联到参考知识库,是生物医学文本挖掘流程中的关键步骤,能够实现跨文档的信息聚合。然而,执行这两个步骤的工具很少在其开发环境中被应用,而是被应用于现实环境中——即取决于应用场景的文本集合,这些集合与工具训练所用的文本在关注点、体裁、风格及文本类型等方面均存在差异。这引发了质疑:生物医学文本挖掘工具所报告的性能是否能够可靠地应用于下游应用场景?本文报告了一项精心设计的跨语料库命名实体提取基准测试的结果,在该测试中,工具被系统地应用于其训练过程中未使用的语料库。基于对28个已发表系统的调研,我们选取其中5个工具,在涵盖四种不同实体类型的三个公开语料库上进行深入分析。工具间的比较结果呈现出复杂图景,表明在跨语料库场景下,性能显著低于同语料库场景下的报告值。HunFlair2平均表现最佳,PubTator紧随其后。我们的结果表明,与原始文献相比,生物医学文本挖掘工具的用户在现实环境中应用这些工具时,应预期其性能会有所下降,并显示需进一步研究以提升生物医学文本挖掘工具的鲁棒性。