A common task in high-throughput biology is to screen for associations across thousands of units of interest, e.g., genes or proteins. Often, the data for each unit are modeled as Gaussian measurements with unknown mean and variance and are summarized as per-unit sample averages and sample variances. The downstream goal is multiple testing for the means. In this domain, it is routine to "moderate" (that is, to shrink) the sample variances through parametric empirical Bayes methods before computing p-values for the means. Such an approach is asymmetric in that a prior is posited and estimated for the nuisance parameters (variances) but not the primary parameters (means). Our work initiates the formal study of this paradigm, which we term "empirical partially Bayes multiple testing." In this framework, if the prior for the variances were known, one could proceed by computing p-values conditional on the sample variances -- a strategy called partially Bayes inference by Sir David Cox. We show that these conditional p-values satisfy an Eddington/Tweedie-type formula and are approximated at nearly-parametric rates when the prior is estimated by nonparametric maximum likelihood. The estimated p-values can be used with the Benjamini-Hochberg procedure to guarantee asymptotic control of the false discovery rate. Even in the compound setting, wherein the variances are fixed, the approach retains asymptotic type-I error guarantees.
翻译:高通量生物学中的一个常见任务是对成千上万个感兴趣的单元(例如基因或蛋白质)进行关联性筛选。通常,每个单元的数据被建模为具有未知均值和方差的高斯测量值,并汇总为每个单元的样本均值和样本方差。下游目标是对均值进行多重检验。在该领域中,通常会在计算均值的 p 值之前,通过参数化经验贝叶斯方法对样本方差进行“调节”(即收缩)。这种方法具有不对称性,即对干扰参数(方差)设定并估计了先验分布,但对主要参数(均值)则没有。我们的工作正式开启了对此范式的系统研究,我们称之为“经验性部分贝叶斯多重检验”。在此框架下,如果方差的先验分布已知,人们可以基于样本方差计算条件 p 值——这一策略被大卫·考克斯爵士称为部分贝叶斯推断。我们证明了这些条件 p 值满足埃丁顿/特威迪型公式,并且当先验通过非参数最大似然估计时,能以近乎参数化的速率被近似。估计出的 p 值可与 Benjamini-Hochberg 程序结合使用,以保证对错误发现率的渐近控制。即使在方差固定的复合设定下,该方法仍能保持渐近的第一类错误保证。