Bayesian Functional Analysis for Untargeted Metabolomics Data with Matching Uncertainty and Small Sample Sizes

Untargeted metabolomics based on liquid chromatography-mass spectrometry technology is quickly gaining widespread application given its ability to depict the global metabolic pattern in biological samples. However, the data is noisy and plagued by the lack of clear identity of data features measured from samples. Multiple potential matchings exist between data features and known metabolites, while the truth can only be one-to-one matches. Some existing methods attempt to reduce the matching uncertainty, but are far from being able to remove the uncertainty for most features. The existence of the uncertainty causes major difficulty in downstream functional analysis. To address these issues, we develop a novel approach for Bayesian Analysis of Untargeted Metabolomics data (BAUM) to integrate previously separate tasks into a single framework, including matching uncertainty inference, metabolite selection, and functional analysis. By incorporating the knowledge graph between variables and using relatively simple assumptions, BAUM can analyze datasets with small sample sizes. By allowing different confidence levels of feature-metabolite matching, the method is applicable to datasets in which feature identities are partially known. Simulation studies demonstrate that, compared with other existing methods, BAUM achieves better accuracy in selecting important metabolites that tend to be functionally consistent and assigning confidence scores to feature-metabolite matches. We analyze a COVID-19 metabolomics dataset and a mouse brain metabolomics dataset using BAUM. Even with a very small sample size of 16 mice per group, BAUM is robust and stable. It finds pathways that conform to existing knowledge, as well as novel pathways that are biologically plausible.

翻译：基于液相色谱-质谱联用技术的非靶向代谢组学因其能够描绘生物样本中的全局代谢模式而迅速获得广泛应用。然而，该数据存在噪声，且因缺乏样本中测得数据特征的明确身份而备受困扰。数据特征与已知代谢物之间存在多种潜在匹配关系，而真实对应关系只能是"一对一"匹配。现有方法虽尝试降低匹配不确定性，但远无法消除大部分特征的不确定性。这种不确定性的存在给下游功能分析带来重大困难。为解决这些问题，我们开发了一种非靶向代谢组学数据贝叶斯分析（BAUM）的新方法，将此前独立的匹配不确定性推断、代谢物选择和功能分析任务整合至统一框架中。通过引入变量间的知识图谱并采用相对简单的假设，BAUM能够分析小样本数据集。通过允许特征-代谢物匹配具有不同置信水平，该方法适用于特征身份部分已知的数据集。模拟研究表明，与现有其他方法相比，BAUM在倾向于选择功能一致的重要代谢物以及为特征-代谢物匹配分配置信分数方面具有更高准确性。我们使用BAUM分析了一个COVID-19代谢组学数据集和一个小鼠脑代谢组学数据集。即使每组仅有16只小鼠的极小样本量，BAUM仍表现出稳健性和稳定性。它既发现了符合已有知识的通路，也发现了生物学上合理的新颖通路。