A Biomedical Knowledge Graph for Biomarker Discovery in Cancer

Structured and unstructured data and facts about drugs, genes, protein, viruses, and their mechanism are spread across a huge number of scientific articles. These articles are a large-scale knowledge source and can have a huge impact on disseminating knowledge about the mechanisms of certain biological processes. A domain-specific knowledge graph~(KG) is an explicit conceptualization of a specific subject-matter domain represented w.r.t semantically interrelated entities and relations. A KG can be constructed by integrating such facts and data and be used for data integration, exploration, and federated queries. However, exploration and querying large-scale KGs is tedious for certain groups of users due to a lack of knowledge about underlying data assets or semantic technologies. Such a KG will not only allow deducing new knowledge and question answering(QA) but also allows domain experts to explore. Since cross-disciplinary explanations are important for accurate diagnosis, it is important to query the KG to provide interactive explanations about learned biomarkers. Inspired by these, we construct a domain-specific KG, particularly for cancer-specific biomarker discovery. The KG is constructed by integrating cancer-related knowledge and facts from multiple sources. First, we construct a domain-specific ontology, which we call OncoNet Ontology (ONO). The ONO ontology is developed to enable semantic reasoning for verification of the predictions for relations between diseases and genes. The KG is then developed and enriched by harmonizing the ONO, additional metadata schemas, ontologies, controlled vocabularies, and additional concepts from external sources using a BERT-based information extraction method. BioBERT and SciBERT are finetuned with the selected articles crawled from PubMed. We listed down some queries and some examples of QA and deducing knowledge based on the KG.

翻译：关于药物、基因、蛋白质、病毒及其机制的结构化和非结构化数据与事实散布于大量科学文献中。这些文献是大规模的知识来源，对传播特定生物过程机制的知识具有重要影响。特定领域知识图谱（KG）是针对特定主题领域的显式概念化表达，涉及语义上相互关联的实体及其关系。通过整合此类事实和数据可构建知识图谱，并用于数据集成、探索及联邦查询。然而，由于缺乏对底层数据资产或语义技术的了解，部分用户群体在探索和查询大规模知识图谱时面临困难。此类知识图谱不仅能够推导新知识、支持问答系统，还能让领域专家进行探索。由于跨学科解释对精准诊断至关重要，因此需要通过查询知识图谱来提供关于学习到的生物标志物的交互式解释。受此启发，我们构建了一个特定领域知识图谱，尤其针对癌症特异性生物标志物发现。该知识图谱通过整合多源癌症相关知识与事实构建而成。首先，我们构建了名为OncoNet本体（ONO）的领域本体。ONO本体的开发旨在实现语义推理，用于验证疾病与基因间关系的预测结果。随后，通过协调ONO、附加元数据模式、本体、受控词汇及来自外部来源的额外概念，并采用基于BERT的信息提取方法，我们开发并丰富了该知识图谱。针对从PubMed爬取的文章精选集，我们对BioBERT和SciBERT进行了微调。最后，我们列出了一些查询示例及基于该知识图谱的问答与知识推导案例。