Open Biomedical Knowledge Graphs at Scale: Construction, Federation, and AI Agent Access with Samyama Graph Database

Biomedical knowledge is fragmented across siloed databases -- Reactome for pathways, STRING for protein interactions, Gene Ontology for functional annotations, ClinicalTrials.gov for study registries, and dozens more. Researchers routinely download flat files from each source and write bespoke scripts to cross-reference them, a process that is slow, error-prone, and not reproducible. We present two open-source biomedical knowledge graphs -- Pathways KG (118,686 nodes, 834,785 edges from 5 sources) and Clinical Trials KG (7,774,446 nodes, 26,973,997 edges from 5 sources) -- built on Samyama, a high-performance graph database written in Rust. Our contributions are threefold. First, we describe a reproducible ETL pattern for constructing large-scale KGs from heterogeneous public data sources, with cross-source deduplication, batch Cypher loading, and portable snapshot export. Second, we demonstrate cross-KG federation: loading both snapshots into a single graph tenant enables property-based joins across datasets, answering questions like ``Which biological pathways are disrupted by drugs currently in Phase~3 trials for breast cancer?'' -- a query that neither KG can answer alone. Third, we introduce schema-driven MCP server generation: each KG automatically exposes typed tools for LLM agents via the Model Context Protocol, enabling natural-language access to graph queries without manual tool authoring. All data sources are open-license (CC~BY~4.0, CC0, OBO). Snapshots, ETL code, and MCP configurations are publicly available. The combined federated graph (7.89M nodes, 27.8M edges) loads in 76 seconds on commodity hardware (Mac Mini M4, 16GB RAM), and the signature cross-KG query -- ``which pathways are disrupted by drugs in Phase~3 breast cancer trials?'' -- returns validated results in 2.1 seconds.

翻译：生物医学知识分散于多个孤立数据库中——Reactome存储通路信息，STRING记录蛋白质相互作用，Gene Ontology提供功能注释，ClinicalTrials.gov收录临床试验注册信息，此外还有数十个其他数据库。研究人员通常需要从每个数据源下载平面文件，并编写定制化脚本进行交叉引用，这一过程不仅耗时、易错，且难以复现。本文提出两个开源生物医学知识图谱——通路知识图谱（整合5个数据源，包含118,686个节点和834,785条边）与临床试验知识图谱（整合5个数据源，包含7,774,446个节点和26,973,997条边）——它们构建于采用Rust编写的高性能图数据库Samyama之上。我们的贡献主要体现在三个方面：首先，提出一种可复现的ETL模式，用于从异构公共数据源构建大规模知识图谱，该模式具备跨源去重、批量Cypher加载和便携式快照导出功能；其次，实现跨知识图谱联邦查询——将两个快照加载至同一图租户后，可进行基于属性的跨数据集联合查询，从而回答诸如“哪些生物通路被当前处于乳腺癌III期试验阶段的药物所干扰？”这类任一独立知识图谱均无法解答的问题；最后，引入模式驱动的MCP服务器自动生成机制：每个知识图谱通过模型上下文协议自动为LLM智能体提供类型化工具，无需人工编写工具即可实现自然语言访问图谱查询。所有数据源均采用开放许可协议（CC BY 4.0、CC0、OBO）。快照数据、ETL代码及MCP配置均已公开。在商用硬件（Mac Mini M4，16GB内存）上，组合联邦图谱（789万个节点，2780万条边）加载仅需76秒，而标志性跨图谱查询——“哪些通路被乳腺癌III期试验药物干扰？”——可在2.1秒内返回已验证结果。