Representation Learning for Person or Entity-centric Knowledge Graphs: An Application in Healthcare

Knowledge graphs (KGs) are a popular way to organise information based on ontologies or schemas and have been used across a variety of scenarios from search to recommendation. Despite advances in KGs, representing knowledge remains a non-trivial task across industries and it is especially challenging in the biomedical and healthcare domains due to complex interdependent relations between entities, heterogeneity, lack of standardization, and sparseness of data. KGs are used to discover diagnoses or prioritize genes relevant to disease, but they often rely on schemas that are not centred around a node or entity of interest, such as a person. Entity-centric KGs are relatively unexplored but hold promise in representing important facets connected to a central node and unlocking downstream tasks beyond graph traversal and reasoning, such as generating graph embeddings and training graph neural networks for a wide range of predictive tasks. This paper presents an end-to-end representation learning framework to extract entity-centric KGs from structured and unstructured data. We introduce a star-shaped ontology to represent the multiple facets of a person and use it to guide KG creation. Compact representations of the graphs are created leveraging graph neural networks and experiments are conducted using different levels of heterogeneity or explicitness. A readmission prediction task is used to evaluate the results of the proposed framework, showing a stable system, robust to missing data, that outperforms a range of baseline machine learning classifiers. We highlight that this approach has several potential applications across domains and is open-sourced. Lastly, we discuss lessons learned, challenges, and next steps for the adoption of the framework in practice.

翻译：知识图谱（KGs）是基于本体或模式组织信息的流行方法，已广泛应用于从搜索到推荐的各种场景。尽管KG技术取得了进展，但在工业领域中表示知识仍是一项艰巨任务，尤其在生物医学和医疗保健领域，由于实体间复杂的相互依赖关系、异质性、缺乏标准化以及数据稀疏性，这一挑战尤为突出。KGs可用于发现诊断结果或优先筛选与疾病相关的基因，但通常依赖于非以感兴趣节点或实体（如人）为中心的模式。以实体为中心的KG相对未被充分探索，但在表示与中心节点相关的重要方面以及解锁超越图遍历和推理的下游任务（例如生成图嵌入和训练图神经网络以执行广泛的预测任务）方面具有潜力。本文提出了一种端到端的表示学习框架，用于从结构化和非结构化数据中提取以实体为中心的KG。我们引入了一种星形本体来表示人的多个方面，并以此指导KG的构建。通过利用图神经网络创建图的紧凑表示，并使用不同级别的异质性或显式性进行实验。采用再入院预测任务评估所提出框架的结果，表明该系统稳定且对缺失数据具有鲁棒性，其性能优于一系列基线机器学习分类器。我们强调该方法在多个领域具有潜在应用，并已开源。最后，我们讨论了经验教训、挑战以及该框架在实践中采用的下一步措施。