Biomedical knowledge is fragmented across siloed databases -- Reactome for pathways, STRING for protein interactions, ClinicalTrials.gov for study registries, DrugBank for drug vocabularies, DGIdb for drug-gene interactions, SIDER for side effects. We present three open-source biomedical knowledge graphs -- Pathways KG (118,686 nodes, 834,785 edges from 5 sources), Clinical Trials KG (7,774,446 nodes, 26,973,997 edges from 5 sources), and Drug Interactions KG (32,726 nodes, 191,970 edges from 3 sources) -- built on Samyama, a high-performance graph database written in Rust. Our contributions are threefold. First, we describe a reproducible ETL pattern for constructing large-scale KGs from heterogeneous public data sources, with cross-source deduplication, batch loading (Python Cypher and Rust native loaders), and portable snapshot export. Second, we demonstrate cross-KG federation: loading all three snapshots into a single graph tenant enables property-based joins across datasets. Third, we introduce schema-driven MCP server generation for LLM agent access, evaluated on a new BiomedQA benchmark (40 pharmacology questions): domain-specific MCP tools achieve 98% accuracy vs. 0% for text-to-Cypher and 75% for standalone GPT-4o. All data sources are open-license. The combined federated graph (7.9M nodes, 28M edges) loads in approximately 3 minutes on commodity cloud hardware, and cross-KG queries complete in 80ms-4s.
翻译:生物医学知识分散于多个独立数据库——Reactome存储通路信息,STRING记录蛋白质相互作用,ClinicalTrials.gov收录临床试验注册信息,DrugBank提供药物词汇表,DGIdb包含药物-基因相互作用数据,SIDER存储副作用信息。本文提出三个开源生物医学知识图谱——通路知识图谱(整合5个来源的118,686个节点与834,785条边)、临床试验知识图谱(整合5个来源的7,774,446个节点与26,973,997条边)以及药物相互作用知识图谱(整合3个来源的32,726个节点与191,970条边)——均构建于采用Rust编写的高性能图数据库Samyama之上。我们的贡献包含三个方面:首先,提出一种可复现的ETL范式,用于从异构公共数据源构建大规模知识图谱,涵盖跨源去重、批量加载(支持Python Cypher与Rust原生加载器)及可移植快照导出功能;其次,实现跨知识图谱联邦查询——将三个快照加载至单一图租户后,支持跨数据集的基于属性的关联查询;最后,提出面向LLM智能体访问的Schema驱动型MCP服务器生成方案,并在新建的BiomedQA基准测试(含40个药理学问题)中验证:领域专用MCP工具准确率达98%,显著优于文本转Cypher的0%与独立GPT-4o的75%。所有数据源均采用开放许可。整合后的联邦图谱(790万节点,2800万边)在商用云硬件上加载仅需约3分钟,跨知识图谱查询响应时间为80毫秒至4秒。