Published biomedical information has and continues to rapidly increase. The recent advancements in Natural Language Processing (NLP), have generated considerable interest in automating the extraction, normalization, and representation of biomedical knowledge about entities such as genes and diseases. Our study analyzes germline abstracts in the construction of knowledge graphs of the of the immense work that has been done in this area for genes and diseases. This paper presents SimpleGermKG, an automatic knowledge graph construction approach that connects germline genes and diseases. For the extraction of genes and diseases, we employ BioBERT, a pre-trained BERT model on biomedical corpora. We propose an ontology-based and rule-based algorithm to standardize and disambiguate medical terms. For semantic relationships between articles, genes, and diseases, we implemented a part-whole relation approach to connect each entity with its data source and visualize them in a graph-based knowledge representation. Lastly, we discuss the knowledge graph applications, limitations, and challenges to inspire the future research of germline corpora. Our knowledge graph contains 297 genes, 130 diseases, and 46,747 triples. Graph-based visualizations are used to show the results.
翻译:已发表的生物医学信息持续快速增长。自然语言处理领域的最新进展引发了学术界对自动化提取、标准化和表示基因、疾病等生物医学实体知识的高度关注。本研究通过分析种系文献摘要,对基因与疾病领域已开展的大量工作进行了知识图谱构建研究。本文提出SimpleGermKG——一种连接种系基因与疾病的自动化知识图谱构建方法。在基因与疾病实体提取阶段,我们采用生物医学语料预训练的BERT模型BioBERT;并提出基于本体和规则算法对医学术语进行标准化与消歧。针对文章、基因与疾病间的语义关系,我们采用部分-整体关系方法将各实体与其数据源关联,并以图结构知识表示形式进行可视化。最后,我们探讨了知识图谱的应用场景、局限性及面临的挑战,以启发种系语料库的未来研究方向。本知识图谱包含297个基因、130种疾病及46,747个三元组,并通过图可视化展示研究结果。