Quantitative measurements produced by mass spectrometry proteomics experiments offer a direct way to explore the role of proteins in molecular mechanisms. However, analysis of such data is challenging due to the large proportion of missing values. A common strategy to address this issue is to utilize an imputed dataset, which often introduces systematic bias into downstream analyses if the imputation errors are ignored. In this paper, we propose a statistical framework inspired by doubly robust estimators that offers valid and efficient inference for proteomic data. Our framework combines powerful machine learning tools, such as variational autoencoders, to augment the imputation quality with high-dimensional peptide data, and a parametric model to estimate the propensity score for debiasing imputed outcomes. Our estimator is compatible with the double machine learning framework and has provable properties. In application to both single-cell and bulk-cell proteomic data our method utilizes the imputed data to gain additional, meaningful discoveries and yet maintains good control of false positives.
翻译:质谱蛋白质组学实验产生的定量测量结果,为探索蛋白质在分子机制中的作用提供了直接途径。然而,由于缺失值比例较大,此类数据分析面临挑战。常用策略是使用插补数据集,但若忽略插补误差,往往会给下游分析引入系统性偏差。本文提出一种受双稳健估计器启发的统计框架,为蛋白质组数据提供有效且高效的推断。该框架融合强大的机器学习工具(如变分自编码器),通过高维肽段数据提升插补质量,并采用参数模型估计倾向得分,以消除插补结果的偏差。我们的估计器与双机器学习框架兼容,并具有可证明的性质。在单细胞和群体细胞蛋白质组数据的应用中,该方法利用插补数据获得额外且有意义的发现,同时保持对假阳性的良好控制。