Untargeted metabolomic profiling through liquid chromatography-mass spectrometry (LC-MS) measures a vast array of metabolites within biospecimens, advancing drug development, disease diagnosis, and risk prediction. However, the low throughput of LC-MS poses a major challenge for biomarker discovery, annotation, and experimental comparison, necessitating the merging of multiple datasets. Current data pooling methods encounter practical limitations due to their vulnerability to data variations and hyperparameter dependence. Here we introduce GromovMatcher, a flexible and user-friendly algorithm that automatically combines LC-MS datasets using optimal transport. By capitalizing on feature intensity correlation structures, GromovMatcher delivers superior alignment accuracy and robustness compared to existing approaches. This algorithm scales to thousands of features requiring minimal hyperparameter tuning. Applying our method to experimental patient studies of liver and pancreatic cancer, we discover shared metabolic features related to patient alcohol intake, demonstrating how GromovMatcher facilitates the search for biomarkers associated with lifestyle risk factors linked to several cancer types.
翻译:通过液相色谱-质谱联用技术(LC-MS)进行的非靶向代谢组学分析可测量生物样本中的大量代谢物,从而推动药物开发、疾病诊断和风险预测。然而,LC-MS的低通量特性对生物标志物发现、注释及实验比较构成重大挑战,迫使研究者需合并多个数据集。现有数据整合方法因易受数据变异影响且依赖超参数设置,在实际应用中存在局限性。本文提出GromovMatcher——一种基于最优传输技术自动融合LC-MS数据集的灵活易用算法。通过利用特征强度相关性结构,GromovMatcher相较于现有方法展现出更优越的对齐精度和鲁棒性。该算法可扩展至数千个特征,且仅需极少的超参数调优。将本方法应用于肝癌与胰腺癌患者实验研究时,我们发现了与患者酒精摄入相关的共享代谢特征,这证明了GromovMatcher如何助力搜索与多种癌症类型相关的可改变生活方式风险因素的生物标志物。