A Bayesian Framework for Multivariate Differential Analysis accounting for Missing Data

Current statistical methods in differential proteomics analysis generally leave aside several challenges, such as missing values, correlations between peptide intensities and uncertainty quantification. Moreover, they provide point estimates, such as the mean intensity for a given peptide or protein in a given condition. The decision of whether an analyte should be considered as differential is then based on comparing the p-value to a significance threshold, usually 5%. In the state-of-the-art limma approach, a hierarchical model is used to deduce the posterior distribution of the variance estimator for each analyte. The expectation of this distribution is then used as a moderated estimation of variance and is injected directly into the expression of the t-statistic. However, instead of merely relying on the moderated estimates, we could provide more powerful and intuitive results by leveraging a fully Bayesian approach and hence allow the quantification of uncertainty. The present work introduces this idea by taking advantage of standard results from Bayesian inference with conjugate priors in hierarchical models to derive a methodology tailored to handle multiple imputation contexts. Furthermore, we aim to tackle a more general problem of multivariate differential analysis, to account for possible inter-peptide correlations. By defining a hierarchical model with prior distributions on both mean and variance parameters, we achieve a global quantification of uncertainty for differential analysis. The inference is thus performed by computing the posterior distribution for the difference in mean peptide intensities between two experimental conditions. In contrast to more flexible models that can be achieved with hierarchical structures, our choice of conjugate priors maintains analytical expressions for direct sampling from posterior distributions without requiring expensive MCMC methods.

翻译：当前差异蛋白质组学中的统计方法通常忽略了几个挑战，例如缺失值、肽段强度间的相关性以及不确定性量化。此外，这些方法仅提供点估计，例如特定实验条件下某肽段或蛋白质的平均强度。判断某种分析物是否具有差异，通常基于将p值与显著性阈值（通常为5%）进行比较。在当前的limma方法中，通过层次模型推导每种分析物方差估计量的后验分布，该分布的期望被用作方差的调节估计，并直接代入t统计量的表达式。然而，与其仅依赖调节估计，不如利用全贝叶斯方法提供更强大且直观的结果，从而实现对不确定性的量化。本研究提出这一思路，利用层次模型中共轭先验的贝叶斯推断标准结果，开发出一种适用于多重插补场景的方法论。此外，我们旨在解决更一般的多变量差异分析问题，以考虑肽段间相关性。通过定义层次模型并对均值和方差参数赋予先验分布，我们实现了差异分析中不确定性的全局量化。因此，推断通过计算两种实验条件下平均肽段强度差的其后验分布来完成。与层次结构实现的其他灵活模型不同，我们选择共轭先验可保持分析表达式的解析性，从而无需昂贵的MCMC方法即可直接对后验分布进行采样。