Background: Missing data is a common challenge in mass spectrometry-based metabolomics, which can lead to biased and incomplete analyses. The integration of whole-genome sequencing (WGS) data with metabolomics data has emerged as a promising approach to enhance the accuracy of data imputation in metabolomics studies. Method: In this study, we propose a novel method that leverages the information from WGS data and reference metabolites to impute unknown metabolites. Our approach utilizes a multi-view variational autoencoder to jointly model the burden score, polygenetic risk score (PGS), and linkage disequilibrium (LD) pruned single nucleotide polymorphisms (SNPs) for feature extraction and missing metabolomics data imputation. By learning the latent representations of both omics data, our method can effectively impute missing metabolomics values based on genomic information. Results: We evaluate the performance of our method on empirical metabolomics datasets with missing values and demonstrate its superiority compared to conventional imputation techniques. Using 35 template metabolites derived burden scores, PGS and LD-pruned SNPs, the proposed methods achieved R^2-scores > 0.01 for 71.55% of metabolites. Conclusion: The integration of WGS data in metabolomics imputation not only improves data completeness but also enhances downstream analyses, paving the way for more comprehensive and accurate investigations of metabolic pathways and disease associations. Our findings offer valuable insights into the potential benefits of utilizing WGS data for metabolomics data imputation and underscore the importance of leveraging multi-modal data integration in precision medicine research.
翻译:背景:质谱分析中的代谢组学数据常面临缺失值的挑战,这可能导致分析结果存在偏差且不完整。将全基因组测序(WGS)数据与代谢组学数据相融合,已成为提升代谢组学数据插补精度的新兴研究方向。方法:本研究提出一种创新方法,通过整合WGS数据与参考代谢物信息对未知代谢物进行插补。该方案采用多视图变分自编码器,对负荷评分、多基因风险评分及连锁不平衡(LD)剪枝后的单核苷酸多态性(SNPs)进行联合建模,实现特征提取与代谢组学缺失数据的插补。通过学习两种组学数据的潜在表征,该方法能基于基因组信息有效插补缺失的代谢组学数据。结果:我们在包含缺失值的真实代谢组学数据集上评估了该方法的性能,验证其相较于传统插补方法的优越性。基于35种模板代谢物构建的负荷评分、多基因风险评分及LD剪枝后的SNPs,该方法对71.55%的代谢物实现了R²评分>0.01。结论:将WGS数据整合至代谢组学插补过程中,不仅提升了数据的完整性,还增强了后续分析的可靠性,为代谢途径及疾病关联性的全面精准研究开辟了新路径。本研究揭示了利用WGS数据进行代谢组学数据插补的潜在优势,并强调了多模态数据整合在精准医学研究中的关键价值。