Knowledge graphs (KGs) are an important tool for representing complex relationships between entities in the biomedical domain. Several methods have been proposed for learning embeddings that can be used to predict new links in such graphs. Some methods ignore valuable attribute data associated with entities in biomedical KGs, such as protein sequences, or molecular graphs. Other works incorporate such data, but assume that entities can be represented with the same data modality. This is not always the case for biomedical KGs, where entities exhibit heterogeneous modalities that are central to their representation in the subject domain. We propose a modular framework for learning embeddings in KGs with entity attributes, that allows encoding attribute data of different modalities while also supporting entities with missing attributes. We additionally propose an efficient pretraining strategy for reducing the required training runtime. We train models using a biomedical KG containing approximately 2 million triples, and evaluate the performance of the resulting entity embeddings on the tasks of link prediction, and drug-protein interaction prediction, comparing against methods that do not take attribute data into account. In the standard link prediction evaluation, the proposed method results in competitive, yet lower performance than baselines that do not use attribute data. When evaluated in the task of drug-protein interaction prediction, the method compares favorably with the baselines. We find settings involving low degree entities, which make up for a substantial amount of the set of entities in the KG, where our method outperforms the baselines. Our proposed pretraining strategy yields significantly higher performance while reducing the required training runtime. Our implementation is available at https://github.com/elsevier-AI-Lab/BioBLP .
翻译:知识图谱(KG)是表示生物医学领域中实体间复杂关系的重要工具。目前已提出多种方法通过学习嵌入来预测这类图谱中的新链接。一些方法忽略了生物医学知识图谱中与实体相关的有价值属性数据(如蛋白质序列或分子图)。另一些工作虽整合了此类数据,但假设实体可用相同数据模态表示。然而在生物医学知识图谱中,实体通常呈现异质性模态,这些模态对其在学科领域中的表示至关重要。我们提出了一种模块化框架,用于学习具有实体属性的知识图谱嵌入,该框架支持编码不同模态的属性数据,同时兼容属性缺失的实体。此外,我们还提出了一种高效的预训练策略,以减少所需的训练时长。我们使用包含约200万条三元组的生物医学知识图谱训练模型,并在链接预测与药物-蛋白质相互作用预测任务上评估所得实体嵌入的性能,与未考虑属性数据的方法进行对比。在标准链接预测评估中,所提方法虽取得有竞争力的表现,但性能略低于未使用属性数据的基线方法。而在药物-蛋白质相互作用预测任务中,该方法优于基线方法。我们发现,当处理低度实体(这些实体占知识图谱实体集的显著比例)时,所提方法在设定条件下表现优于基线。我们提出的预训练策略在降低训练时长的同时显著提升了性能。我们的实现代码已开源至 https://github.com/elsevier-AI-Lab/BioBLP 。