From Large Language Models to Knowledge Graphs for Biomarker Discovery in Cancer

Domain experts often rely on up-to-date knowledge for apprehending and disseminating specific biological processes that help them design strategies to develop prevention and therapeutic decision-making. A challenging scenario for artificial intelligence (AI) is using biomedical data (e.g., texts, imaging, omics, and clinical) to provide diagnosis and treatment recommendations for cancerous conditions. Data and knowledge about cancer, drugs, genes, proteins, and their mechanism is spread across structured (knowledge bases (KBs)) and unstructured (e.g., scientific articles) sources. A large-scale knowledge graph (KG) can be constructed by integrating these data, followed by extracting facts about semantically interrelated entities and relations. Such KGs not only allow exploration and question answering (QA) but also allow domain experts to deduce new knowledge. However, exploring and querying large-scale KGs is tedious for non-domain users due to a lack of understanding of the underlying data assets and semantic technologies. In this paper, we develop a domain KG to leverage cancer-specific biomarker discovery and interactive QA. For this, a domain ontology called OncoNet Ontology (ONO) is developed to enable semantic reasoning for validating gene-disease relations. The KG is then enriched by harmonizing the ONO, controlled vocabularies, and additional biomedical concepts from scientific articles by employing BioBERT- and SciBERT-based information extraction (IE) methods. Further, since the biomedical domain is evolving, where new findings often replace old ones, without employing up-to-date findings, there is a high chance an AI system exhibits concept drift while providing diagnosis and treatment. Therefore, we finetuned the KG using large language models (LLMs) based on more recent articles and KBs that might not have been seen by the named entity recognition models.

翻译：领域专家通常依赖最新知识来理解和传播特定的生物学过程，这些过程有助于他们制定预防与治疗决策策略。人工智能（AI）面临的一个挑战性场景是，利用生物医学数据（如文本、影像、组学和临床数据）为癌症病变提供诊断和治疗建议。关于癌症、药物、基因、蛋白质及其机制的数据和知识分布在结构化（知识库）和非结构化（如科学文章）来源中。通过整合这些数据，并提取关于语义互连实体和关系的事实，可以构建大规模知识图谱（KG）。此类知识图谱不仅支持探索和问答（QA），还能帮助领域专家推断新知识。然而，由于缺乏对底层数据资产和语义技术的理解，非领域用户在对大规模知识图谱进行探索和查询时颇为繁琐。本文开发了一个领域知识图谱，以支持特定癌症生物标志物的发现和交互式问答。为此，我们构建了一种名为OncoNet本体（ONO）的领域本体，用于实现验证基因-疾病关系的语义推理。随后，通过协调ONO、受控词汇表以及来自科学文章的额外生物医学概念，并采用基于BioBERT和SciBERT的信息抽取（IE）方法，对知识图谱进行了丰富。此外，由于生物医学领域不断演进，新发现往往取代旧有认知，若不采用最新发现，AI系统在提供诊断和治疗时极可能发生概念漂移。因此，我们基于大型语言模型（LLM）对知识图谱进行了微调，所用素材为可能未被命名实体识别模型处理过的最新文章和知识库。