Bayesian Functional Analysis for Untargeted Metabolomics Data with Matching Uncertainty and Small Sample Sizes

Untargeted metabolomics based on liquid chromatography-mass spectrometry technology is quickly gaining widespread application given its ability to depict the global metabolic pattern in biological samples. However, the data is noisy and plagued by the lack of clear identity of data features measured from samples. Multiple potential matchings exist between data features and known metabolites, while the truth can only be one-to-one matches. Some existing methods attempt to reduce the matching uncertainty, but are far from being able to remove the uncertainty for most features. The existence of the uncertainty causes major difficulty in downstream functional analysis. To address these issues, we develop a novel approach for Bayesian Analysis of Untargeted Metabolomics data (BAUM) to integrate previously separate tasks into a single framework, including matching uncertainty inference, metabolite selection, and functional analysis. By incorporating the knowledge graph between variables and using relatively simple assumptions, BAUM can analyze datasets with small sample sizes. By allowing different confidence levels of feature-metabolite matching, the method is applicable to datasets in which feature identities are partially known. Simulation studies demonstrate that, compared with other existing methods, BAUM achieves better accuracy in selecting important metabolites that tend to be functionally consistent and assigning confidence scores to feature-metabolite matches. We analyze a COVID-19 metabolomics dataset and a mouse brain metabolomics dataset using BAUM. Even with a very small sample size of 16 mice per group, BAUM is robust and stable. It finds pathways that conform to existing knowledge, as well as novel pathways that are biologically plausible.

翻译：非靶向代谢组学基于液相色谱-质谱技术，因其能够描绘生物样本的整体代谢模式而迅速获得广泛应用。然而，这类数据存在噪声大、且来自样本的数据特征缺乏明确身份标识的问题。数据特征与已知代谢物之间存在多种潜在匹配关系，而真实情况只能是一一对应。现有方法尝试降低匹配不确定性，但远未能消除大多数特征的不确定性。这种不确定性的存在给下游功能分析带来了重大困难。为解决这些问题，我们开发了一种非靶向代谢组学数据贝叶斯分析（BAUM）新方法，将此前独立的任务整合到单一框架中，包括匹配不确定性推理、代谢物选择和功能分析。通过引入变量间的知识图谱并采用相对简单的假设，BAUM能够分析小样本数据集。通过允许特征-代谢物匹配具有不同置信水平，该方法适用于特征身份部分已知的数据集。模拟研究表明，与现有其他方法相比，BAUM在选择趋于功能一致的重要代谢物以及为特征-代谢物匹配分配置信分数方面具有更高的准确性。我们使用BAUM分析了新冠病毒代谢组学数据集和小鼠脑代谢组学数据集。即使在每组仅16只小鼠的极小样本量下，BAUM仍表现出鲁棒性和稳定性。它既能发现符合现有知识的通路，也能发现具有生物学合理性的新通路。