MetaboKG: An Analysis-centric Knowledge Graph Framework for Untargeted Metabolomics

Untargeted metabolomics generates large volumes of tandem mass spectrometry (MS/MS) data and computational annotations that can reveal molecular mechanisms across organisms and environments. Public reuse has improved through harmonized repository metadata and access infrastructures such as Pan-ReDU, and through metabolomics knowledge graphs such as ENPKG and METRIN-KG. Yet the analytical layer remains fragmented: spectra, features, workflow outputs, annotations, confidence evidence, and contextual metadata are still scattered across repositories and tabular artifacts. We present MetaboKG, an analysis-centric knowledge graph framework for engineering reusable metabolomics knowledge from public repositories, metadata, and GNPS molecular network results. MetaboKG contributes a transformation workflow that preserves links between repository exports, analytical files, spectra, features, and annotation results; a semantic model grounded in PROV-O and SIO and aligned with the Mass Spectrometry ontology (MS), ChEBI, NCBITaxon, ENVO, and NCIT to represent provenance, analytical evidence, metadata attributes, and controlled vocabulary terms; and a Universal Annotation Identifier strategy extending the Universal Spectrum Identifier (USI) with workflow-specific components for late binding, incremental ingestion, and post hoc linkage across analyses. We demonstrate MetaboKG at the public-repository scale on 680 GNPS molecular networking results and evaluate it through competency questions covering biochemical enrichment, environmental specificity, and cross instrument analytical variation. Results show that graph-based integration supports traceable annotation reuse and reproducible SPARQL exploration of biochemical relationships that remain fragmented across repository-native resources.

翻译：非靶向代谢组学产生大量串联质谱数据及其计算注释，这些数据能够揭示跨生物体与环境的分子机制。通过统一存储库元数据和访问基础设施（如Pan-ReDU）以及代谢组学知识图谱（如ENPKG和METRIN-KG），公共数据的重用已得到改善。然而，分析层仍处于碎片化状态：质谱、特征、工作流输出、注释、置信度证据以及上下文元数据仍分散在存储库和表格化制品中。我们提出MetaboKG——一个面向分析中心的知识图谱框架，用于从公共存储库、元数据和GNPS分子网络结果中构建可重用的代谢组学知识。MetaboKG贡献了三项内容：一个保留存储库导出文件、分析文件、质谱、特征和注释结果之间链接的转换工作流；一个基于PROV-O和SIO构建、并与质谱本体、ChEBI、NCBITaxon、ENVO和NCIT对齐的语义模型，用于表示溯源、分析证据、元数据属性和受控词汇术语；一种通用注释标识符策略，通过引入工作流特异性组件扩展通用谱标识符，支持后期绑定、增量摄入和跨分析的后期链接。我们在680个GNPS分子网络结果上展示了MetaboKG在公共存储库规模的应用，并通过涵盖生化富集、环境特异性和跨仪器分析变异的胜任力问题对其进行了评估。结果表明，基于图的集成支持可溯源的注释重用和可重复的SPARQL探索，这些探索涉及在存储库原生资源中仍呈碎片化状态的生化关系。