Background Medical research generates millions of publications and it is a great challenge for researchers to utilize this information in full since its scale and complexity greatly surpasses human reading capabilities. Automated text mining can help extract and connect information spread across this large body of literature but this technology is not easily accessible to life scientists. Results Here, we developed an easy-to-use end-to-end pipeline for deep learning- and dictionary-based named entity recognition (NER) of typical entities found in medical research articles, including diseases, cells, chemicals, genes/proteins, and species. The pipeline can access and process large medical research article collections (PubMed, CORD-19) or raw text and incorporates a series of deep learning models fine-tuned on the HUNER corpora collection. In addition, the pipeline can perform dictionary-based NER related to COVID-19 and other medical topics. Users can also load their own NER models and dictionaries to include additional entities. The output consists of publication-ready ranked lists and graphs of detected entities and files containing the annotated texts. An associated script allows rapid inspection of the results for specific entities of interest. As model use cases, the pipeline was deployed on two collections of autophagy-related abstracts from PubMed and on the CORD19 dataset, a collection of 764 398 research article abstracts related to COVID-19. Conclusions The NER pipeline we present is applicable in a variety of medical research settings and makes customizable text mining accessible to life scientists.
翻译:背景:医学研究每年产生数百万篇出版物,其规模和复杂性远超人类阅读能力,如何充分利用这些信息成为研究者的重大挑战。自动文本挖掘可帮助提取和关联海量文献中的离散信息,但生命科学家难以直接应用这一技术。结果:我们开发了一套易用的端到端流水线,用于对医学研究论文中的典型实体(包括疾病、细胞、化学物质、基因/蛋白质及物种)进行基于深度学习和词典的命名实体识别(NER)。该流水线可访问并处理大型医学研究论文集合(PubMed、CORD-19)或原始文本,整合了基于HUNER语料库集合微调的深度学习模型序列。此外,该流水线还可执行COVID-19及其他医学主题相关的基于词典的NER。用户可加载自定义NER模型和词典以扩展实体类型。输出内容包括可直接用于发表的实体排序列表、实体关联图谱及带有注释的文本文件,配套脚本可对特定目标实体进行快速结果可视化。作为模型用例,该流水线被应用于两个自噬相关PubMed摘要数据集及包含764398篇COVID-19相关研究摘要的CORD-19数据集。结论:本研究提出的NER流水线可适配多种医学研究场景,使生命科学家能够便捷地进行可定制文本挖掘。