Bayesian Functional Analysis for Untargeted Metabolomics Data with Matching Uncertainty and Small Sample Sizes

Untargeted metabolomics based on liquid chromatography-mass spectrometry technology is quickly gaining widespread application given its ability to depict the global metabolic pattern in biological samples. However, the data is noisy and plagued by the lack of clear identity of data features measured from samples. Multiple potential matchings exist between data features and known metabolites, while the truth can only be one-to-one matches. Some existing methods attempt to reduce the matching uncertainty, but are far from being able to remove the uncertainty for most features. The existence of the uncertainty causes major difficulty in downstream functional analysis. To address these issues, we develop a novel approach for Bayesian Analysis of Untargeted Metabolomics data (BAUM) to integrate previously separate tasks into a single framework, including matching uncertainty inference, metabolite selection, and functional analysis. By incorporating the knowledge graph between variables and using relatively simple assumptions, BAUM can analyze datasets with small sample sizes. By allowing different confidence levels of feature-metabolite matching, the method is applicable to datasets in which feature identities are partially known. Simulation studies demonstrate that, compared with other existing methods, BAUM achieves better accuracy in selecting important metabolites that tend to be functionally consistent and assigning confidence scores to feature-metabolite matches. We analyze a COVID-19 metabolomics dataset and a mouse brain metabolomics dataset using BAUM. Even with a very small sample size of 16 mice per group, BAUM is robust and stable. It finds pathways that conform to existing knowledge, as well as novel pathways that are biologically plausible.

翻译：基于液相色谱-质谱联用技术的非靶向代谢组学，因其能够描绘生物样本中的全局代谢模式而迅速获得广泛应用。然而，该数据存在噪声，且受限于样本中测得数据特征缺乏明确身份标识的问题。数据特征与已知代谢物之间存在多种潜在匹配关系，而真实情况只能是——匹配。现有方法虽试图降低匹配不确定性，但远未能消除大部分特征的不确定性。这种不确定性的存在给下游功能分析带来了重大困难。为解决这些问题，我们开发了一种新颖的非靶向代谢组学数据贝叶斯分析（BAUM）方法，将此前相互独立的任务（包括匹配不确定性推断、代谢物选择和功能分析）整合至统一框架中。通过引入变量间知识图谱并采用相对简单的假设，BAUM能够分析小样本数据集。通过允许特征-代谢物匹配的不同置信水平，该方法适用于特征身份部分已知的数据集。模拟研究表明，与现有其他方法相比，BAUM在选择具有功能一致性的重要代谢物以及为特征-代谢物匹配分配置信分数方面具有更高的准确性。我们使用BAUM分析了COVID-19代谢组学数据集和小鼠脑代谢组学数据集。即使每组仅16只小鼠的极小样本量，BAUM仍表现出稳健性和稳定性。它既发现了符合现有知识的通路，也发现了具有生物学合理性的新通路。